|
WebContentment Web Usability & Data Content Services The Trading Company of Peter Millington |
||||||||||||
|
Merging Diverse DataThe ProblemAll pharmaceutical discovery laboratories acquire compounds from outside organisations, in order to increase the diversity of their compound collection. Files of structures and properties are exchanged which may contain anything from a few hundred to tens of thousands of compounds. There is a de facto standard file format for exchanging chemical data - called an SD File. The problem is that the standard is very flexible:
Solution 1 - Building a Holding DatabaseIt was decided that offered compounds would be added to a holding database, where they could be checked for uniqueness and suitable properties. Ultimately, data could be copied from here into the corporate database for purchased compounds. The corporate database therefore determined which standards should be used. Thus, for instance, as the corporate standard was for parents-only structures new compounds were run through a program that identified and removed salt moieties from the structures and placed the salt name in a text field. Thorough analysis of a large selection of SD Files that were on hand from diverse suppliers revealed that half a dozen fields occurred frequently - e.g. salt, purity, melting point, weight available, etc. These were worth defining in the holding database as specific fields, with a couple of general "Other Data" fields for any additional that might be provided with a particular file. All fields were defined as text fields, to allow for varying formats, and conversion was simply a case of changing field names. Solution 2 - Bulk Registration from SD FilesThe second approach was to set up a system that could process the data in any SD File and register the compounds in the corporate database. There were no constraints on the SD File. It could be provided by the compound supplier top go with a consignment of samples. Alternatively, it be could be generated from the Intermediate database just described, using a list of compound IDs provided by the supplier. The files were pre-processed using a Visual Basic application that was custom-built to our specification by a software consultancy. In principle, this separated the structures from the non-structural data, creating:
The VB application identified what fields were present in the SD File, and displayed them as a table, with examples of the data. Fields were then matched with the fields of the corporate database using drop-down pick lists. If necessary, text fields could be parsed using Microsoft Excel formulae - e.g. to split data into two fields, to extract numeric data, or to convert weights from milligrams to grams. After pre-processing, the tab-delimited text files were loaded into a Microsoft Excel spreadsheet for finessing and final manual quality checks. Finally, the files were loaded into the corporate database - the SD File into a molecule database, and the text-delimited data file into the relational database, with the compound identifiers being used to match up records. OutcomePreviously, lots of manual on-off processing has been required for each SD File. This meant that the turn-round was slow and there were delays in availability of data and/or compounds. With the new system, although manual effort was not totally eliminated, much less was required, and the process was more predictable. Consequently the turn-round was rapid and service levels were much improved. |
||||||||||||