NCBI Toolkit & IBM DB2 – asn.h

December 17th, 2009

While working on code for my Influenza Sequence Mapping Project [url] I ran into a frustrating incompatibility between the IBM DB2 Embedded SQL C client [url] and the NCBI Toolbox [url].  Both of these APIs contain a header file named “asn.h.”  By chance I had been setting my preprocessor include path as:

export CPPFLAGS=”-I/usr/include/ncbi -I/home/db2inst1/sqllib/include”

With the NCBI Toolbox listed first in the include path the asn.h from that API was being read.  In my user code this worked out fine since I wasn’t making use of the asn.h from the IBM DB2 client.  Unfortunately, while setting up a new sandbox for development I happened to switch the order of the directories:

export CPPFLAGS=”-I/home/db2inst1/sqllib/include -I/usr/include/ncbi”

With this seemingly inconsequential change the code would no longer compile.  The simple statement #include <blast.h> which worked in the first case now led to compilation errors.  The header include chain in the NCBI Toolbox was now reading asn.h from IBM DB2 rather than the NCBI one that was needed.  Finding the cause of this error was difficult because the compiler did not report which asn.h was being included and I was working under the assumption that only one (the NCBI one) was available.  In fact I needed assistance from NCBI Support to realize that IBM DB2 was covering the NCBI header.

Now recognizing the problem I realized that I might be inadvertently covering header files all over the place and in some cases such coverages might be introducing insidious bugs that are not detected at compilation.  Indeed the issue of non-unique header file names is documented by CERT in the “CERT C Secure Coding Standard [url]” as “PRE08-C. Guarantee that header file names are unique [url].”  Instances of this problem should be easily detectable by the preprocessor by searching the include path and reporting a warning or error if an #include directive has made an ambiguous request.  I have filed GCC Bug 42407 [url] as an enhancement request for such a feature.

Read full article | No Comments »

H1N1 Sequence and Publications Mapping

May 1st, 2009

The recent outbreak of H1N1 has increased researchers attention on flu viruses. The strain of H1N1 originating in Mexico in April has already been sequenced by a number of different researchers and the sequences have been posted to the NCBI databases [GenBank, 2009]. An issue of interest to researchers are the similarities and differences between this new strain and versions of the flu virus that have been experienced and studied in the past. It is notable that swine flu has historically been rare in humans and the number of published studies is limited.  “Part of the problem is that swine flu infections of humans have rarely been documented. Virologist Christopher Olsen of the University of Wisconsin, Madison, School of Veterinary Medicine co-authored a study in 2007 that found only 50 cases in the biomedical literature dating back to 1958 [Cohen, 2009].”

The Large Graph Layout (LGL) algorithm was originally developed to show sequence homology between proteins across multiple organisms. LGL was used to produce a map of proteins [Adai, 2004]. Within the map co-location of sequences is shown to be correlated with similarities in function and evolution. It may be useful then to run LGL on the set of sequences in the NCBI databases for viruses to provide a map of the new H1N1 virus in functional and evolutionary context of other flu viruses.

There is some evidence that H1N1 may be related to the flus of 1918 and 1930 as reported by Juergen A. Richt of Kansas State University [Richt, 2009]. Richt’s work is scheduled to be published in Journal of Virology this month. Olsen’s 2007 study of research on swine flu goes back to the 1958. Plotting research papers on swine flu against the positions of those flu sequences in the map may be useful in helping researchers understand the coverage of research relative to the breadth of viral homology or dissimilarity.

The methodology to produce a homology map seems to be straightforward in theory but potentially time consuming in practice. The exercise could have utility in identifying bottlenecks that slow quick profiling of new viral strains. First the virus sequences would need to be identified in the NCBI database. Then each subsequence of interest such as H1, H5, N1, etc could be identified. For the set of all subsequences selected for inclusion in the map they would need to be compared with BLAST. The continuous variable of distance would need to be coded into a boolean indicating whether an edge should be instantiated between them or not. For this it seems reasonable to use the threshold used in the original LGL study. A graph would then be instantiated with nodes representing subsequences and edges representing BLAST homology measures above the threshold. At this point the data could be run through LGL to get a 2D or 3D layout for the map. This would serve as the base layer for additional annotation.

Correlating the published research with the sequences will require processing the meta-data or perhaps full text of the articles depending on how sufficiently the relations between the viruses studied and their NCBI indexes have been already established. Give the small number of swine flu publications the survey paper might be used to manually establish the relations. These could then be plotted as a layer in the map to show the intensity of research relative to the distribution of sequences.

With these two layers in place a researcher could then get a sense how well studied the broad space of viral sequences actually is. This may be a useful perspective for researchers such as Richt who are evaluating the evolutionary development of H1N1 or Cohen who are trying to place the new strain in context of existing works.

Read full article | No Comments »

Graph / Network Analytics

April 28th, 2009

Graph / network data structures are very useful.  A number of tools and APIs have been written to work with such structures.  Many of these include functions that can be run on the graphs.  A function to calculate node degree is an example.  Unfortunately most of the APIs require that the entire graph be loaded into memory before any such functions can be called.  This presents a significant limitation on the size of the graph that can be analyzed.  Examples include:

  • igraph: This looks to be a great C API with all sorts of useful functions.  Unfortunately as the documentation notes: “The rule of thumb is that if your graph fits into the physical memory then igraph can handle it.”
  • Boost Graph Library: As a Boost library BGL uses the C++ STL to represent graphs and includes a number of functions for analytics.  These include the vector, list and set which are by default if not by definition memory-only containers.
  • Prefuse: A Java API, Prefuse uses a TupleSet interface to vertices and edges.  These are by default implemented with tables in memory.
  • JUNG: Another Java API with a terrific set of algorithms for graph analytics.  Perhaps it would be useful to compare Prefuse and JUNG to see if the back-end storage mechanisms could be unified.

Relational databases are mature technologies that manage the streaming of data on disk through result sets in memory in a very fast and efficient manner.  One option for extending graph analytics to larger graphs would be to take advantage of relational databases for the storage rather than relying on containers in memory.  BGL appears to offer some potential for this.  The key would be to find a C++ container that both exposes the necessary interface for a BGL adjacency list and also uses a relational database for storage and retrieval.  The other APIs might be extended in this manner as well by swapping out the vertex (node) and edge lists with some dynamic container that uses a DBMS and efficiently cached result set well.  For Prefuse the trick would be to find a JDBC container that could be wrapped with a TupleSet interface.

Read full article | No Comments »

Visualizing the State of Ugi Reaction Experimentation

November 22nd, 2008

Reading through a sample of the UsefulChem Experiments / Reactions (http://usefulchem.wikispaces.com/All+Reactions) reveals the significance of the Ugi reaction.  “The Ugi reaction is a multi-component reaction in organic chemistry involving a ketone or aldehyde, an amine, an isocyanide and a carboxylic acid to form a bis-amide (http://en.wikipedia.org/wiki/Ugi_reaction).”  The “Applications” section of the Wikipedia page is also interesting as it describes the significance of the reaction in terms of variations of the input components.  This creates a complex combinatorics problem.  The space of possible input molecules may be large due to the range of molecules that can fall into the four input component categories.  The input molecules should be structurally similar and they should therefore have similar functional characteristics making it possible to use predictive models to describe the products of the reaction.  I am not a Chemist but that is my naive understanding of the Ugi reaction.

A problem for the researcher then is to explore the combinatorial space with experimentation and then to relate the results to a theoretical understanding.  Given the size of the combinatorial space a challenge is to perform enough experiments to get good coverage of the space.  One view of the space could be defined in theoretical terms by identifying the range of all possible input molecules.  But what is the breadth of coverage that has already been achieved though experimentation?

UsefulChem Experiment 099 (http://usefulchem.wikispaces.com/Exp099) provides a good example of a Ugi reaction.  Here the links to ChemSpider provide enough information for a spider to infer that the document describes a Ugi reaction.  A spider could also identify the four input components.  The ketone or aldehyde component is benzaldehyde.  The amine is furfurylamine.  The isocyanide is tert-butyl isocyanide and the carboxylic acid is boc-glycine.  The algorithms that would allow such a spider to make these inferences are an opportunity for research.  In this best-case scenario where UsefulChem has linked to ChemSpider the ChemSpider record could be pulled via their web services layer.  The InChI string should hold enough structural information for an algorithm to verify that the molecules fall into the necessary input component categorizations.

This hypothetical spider could then collect reaction information from open-notebook entries, journal articles, patents and other reports.  In the cases where a Ugi reaction and its component molecules can be identified they might be cross-referenced with ChemSpider to collect property information.  Such a dataset could be used to create a map of the coverage of the Ugi reaction space.  The map could be annotated with references to the source reports that describe the reaction.

There are many possibilities for how such a map could be drawn.  One technique might be to create four scatter plots, one for each set of the input components.  A dimensionality reduction technique such as multi-dimensional scaling might be performed on the property information for each component set to get it down to x and y coordinates.  Rendering all four scatter plots in a three-dimensional space would allow for lines to be drawn connecting the points across the plots.  These lines would represent individual Ugi reactions that have been performed and documented.  An example of this kind of visualization is the “3D Parallel Coordinates” view as described at (http://bdtnp.lbl.gov/Fly-Net/content/bid/pcx/ParallelCoordinates/ParallelCoordinates.html).

A 3D Parallel Coordinates View, or similar such visualization, of the state of Ugi reaction experiments could be a useful tool for researchers.  The visualization would allow a researcher to situate their experiment in the context of existing work.  This would provide a unique perspective that could not be obtained by simple keyword search or other existing literature and web search techniques.  It would also provide insight to a modeler who was trying to refine theoretical chemical combinatorics models with experimental results.  Additionally it might help to highlight outliers and other unexpected results that should be attended to.

Read full article | 1 Comment »

Research Projects

November 18th, 2008

Below is a list of some of the research ideas I am currently working on.  These ideas all need further refinement to get more specific research questions and to build out the theoretical orientations, methodologies and evaluation strategies that will be used.  If these are subjects that interest you please let me know.

An Information Theoretic Framework for Visual Analytic Reasoning

This project is concerned with analyzing the structural properties of associative networks that are related to knowledge structures.  The knowledge structures may be concept maps representing the conceptualizations of an individual about a domain or they may be aggregate networks that describe conceptualizations across multiple actors within a domain.  The associative networks are not limited to the concept maps, but are used to aggregate heterogeneous data into a single integrated representation.  This representation is then further refined by relating the higher level concepts with their supporting data.  There are two primary research questions being pursued under this project:

  1. How can temporal patterns be distinguished from other structural patterns?
  2. Can can information metrics be expanded to integrate latent semantics of the information with the structural and temporal properties?

Answers to these questions will likely take the form of measures and algorithms that relate patterns in such structures with sense-making and analytical reasoning processes.

Post-hoc Analysis of the VAST 2008 Challenge

Participation in the VAST 2008 Challenge was a rich experience that provided a lot of data from a longitudinal, purposeful application of Visual Analytics tools.  With the event now over the focus can now shift from answering the questions of the Challenge itself to a more reflective posture of analyzing the process that was used.  It is hoped that a post-hoc analysis will help to identify opportunities for future research.  It is also expected that opportunities for generalizing the practice to other domains will be found.  Based on [ Liu, Z.; Nersessian, N. J. & Stasko, J. T., Distributed Cognition as a Theoretical Framework for Information Visualization, IEEE Transactions on Visualization and Computer Graphics, 2008, 14, 1173-1180 ] it appears the that distributed cognition framework will provide a useful perspective for analysis of the results.  The primary research question here is:

  1. What can a post-hoc analysis of the VAST 2008 Challenge teach us about the role of Visual Analytics in Distributed Cognition?

Knowledge Structure in Experimental Chemistry

Open notebook science (http://en.wikipedia.org/wiki/Open_notebook_science) offers a new and exciting source of data that has the potential to tells us a lot about how science is done.  Bibliometric research has been very productive for the study of knowledge domains.  Bibliometricians use formal research publications and their citations as the unit of analysis.  The act of citing a work is a behavioral indicator that hints at the intentions of the author.  With open notebook science the digital laboratory notebook record is now available as a unit of analysis.  Exploratory research in this area can help us answer the following questions:

  1. Do open notebook entries include new behavioral indicators that can be useful for analyzing knowledge structures?
  2. How can information science and systems take advantage of open notebook entries to support the hypothesis formulation and discovery processes in Chemistry?

All three of these projects tie together.  The data from a “Post-hoc Analysis of the VAST 2008 Challenge” and from “Knowledge Structure in Experimental Chemistry” might serve to inform the development of a robust “Information Theoretic Framework for Visual Analytic Reasoning.”  The Information Theoretic Framework might be combined with the Distributed Cognition Framework so that we have a way of studying how knowledge structures develop and change over time.

Read full article | 2 Comments »

The Social Brain

November 13th, 2008

I had the opportunity to attend a great talk by Professor Clive Gamble at the University of Pennsylvania Museum of Archeology and Anthropology:

“Breaking the Mind Barrier: The Archeology and Evolution of Our Social Brain” with Professor Clive Gamble, Co-Director British Academy Centenary Project, Thursday November 13, 2008 6:00 PM, The University of Pennsylvania Museum of Archeology and Anthropology, http://www.museum.upenn.edu/gamble

Ideas I collected from the presentation
Restated (maybe misstated) by me, not quotations by the presenter.

Socialization mediated by tools / technology.  Emotion as a basis for social cohesiveness.  Getting group size to 150 required language.  Developed along with increase in brain size.  Childhood development; 3-year-olds think that other minds are thinking the same as theirs.  Five-year-olds recognizes the existence of other minds that think differently.  Understanding the existence of other minds that think differently is necessary for the development of empathy, guilt and next order emotions.  These emotions are what create group cohesiveness.  Therefore in order to have a group size of 150 the mind must be developed to recognize the existence of other minds that think differently.  This is a higher order development above self-awareness.

  1. Self-Awareness
  2. Three-year-old children who recognize the existence of another mind.  (This is why 3 year-old children can’t lie).
  3. Five-year-old children who recognize the existence of another mind that thinks differently than their own.

My Thoughts

One consequence of this evolutionary development is the importance of the affective aspects of social computing.  Technology mediated socialization is based on the emotions that hold people together.  MySpace and Facebook have obvious affective components for maintaining cohesiveness of a group.  This is particularly evident in adolescents’ use of the system for socialization.  Complaints from friends that I should stop posting work stuff and only post fun stuff to Facebook is consistent with this.  In posting work-related content I am inconsistent with the more affective kinds of bonds that form around personal content. How can this inform the design of collaborative support for open-notebook science?

Chemistry sites are potentially less emotive than Facebook style collaboration for users who treat them as pure reference systems yet chemistry sites are potentially more emotive than Facebook for users who are moving the field forward.  To make scientific collaboration successful it may be necessary to engage users in a personal level in debates and opinions.

Visualization and knowledge maps can support this kind of engagement with the content by helping to make the debates more explicit.  This will allow users to situate themselves within the group’s emotional structure rather than simply the hierarchical or relation-clustering structures.  Ultimately this can lead to users having stronger feeling about their contributions in the group.  How does the limit of 150 relationships fit into this?  It was interesting to see this number 150 come up three days in a row: Tuesday while reading the November issue of Communications of the ACM, Wednesday during the meeting with Professor Bradley, Thursday on the slides during this presentation.  It is probably worth digging deeper to see if this number is being used correctly or if it has taken on a life of its own in scientific discourse.

Read full article | No Comments »