|
WebContentment Web Usability & Data Content Services The Trading Company of Peter Millington |
||||||||
|
Major Chemical Data Cleansing ProjectThe ProblemFollowing pharmaceutical company mergers, a local chemical database needed to be migrated into a new corporate database. The standards of the old database were different to the new corporate database, especially for the representation of salted compounds and stereoisomers. Also, increased compound acquisitions for high throughput screening and combinatorial chemistry had overloaded the quality assurance system, resulting in eroded data quality. These issues needed to be addressed before the data could be converted for the new database. Project Aims
Initial AssessmentTime was spent exploring the data, to identify and quantify the problems. What was found was: Chemical Structure Issues
Non-Structural Fields
Errors and Discrepancies
Fixability
All in all, over 20,000 compounds were identified that required review. SolutionsFirst of all, local standards were reviewed in association with chemists and the developers of the new corporate database. This resulted in a new registration guide for chemists, which corrected some misconceptions. Strategies Next, a quality plan was prepared. A list of the various problem categories was made and put in prioritised order. In general, the easiest problems were listed first, with the remainder in order of increasing complexity. However, some categories were promoted in the list - for instance if information would be lost during conversion. The strategy was to work as far through the list as possible during a pre-set time of two months. It was decided that difficult problems would not be checked against archive records, and therefore it was accepted that some problems would not be resolved prior to migration. An appropriate logging system was also set up so that we knew which compounds had and had not been reviewed or fixed. Methods & Tools Text fields were generally standardised using Microsoft Excel. Unique values were downloaded into a spreadsheet where standardised values were manually entered against the originals. Macros were then used to generate appropriate update commands for records in the main database. Structures and associated notes were handled using Accord for Excel, which allows structures to be viewed alongside normal text and numeric data. Notes were processed in a similar way to the other text fields. For structures, columns were added indicating the fixes they required, which were then used to prepare lists of manual drawing changes. A PhD chemist was engaged on contract from a chemical database publisher to assist with assessment and drawing for the duration of the project. To help assess stereochemical configurations, a simple Visual Basic tool using components from MDL's Chemical Business Rules Manager software. A structure could be cut and pasted into the tool, which then identified and colour-coded the types of stereocentres and geometrical bonds. QA for New Compounds Lastly, arrangements for on-going quality assurance were revamped to prevent errors at source. It was possible to retro-fit some additional automatic error checking to the registration application. However, the new registration guide was the key factor. This was distributed to all the chemists and endorsed by their managers. New compounds were monitored daily for errors, which were queried immediately with the registering chemist. In part, this was an educational exercise. Special web pages were written concerning recurring errors, explaining the problem and its implications, and describing how to resolve it. Appropriate copies of these pages were passed to chemists when querying errors. Consequently, the number of queries was reduced. OutcomeAt the end of the project, over 95% of the 20,000 compounds that needed review had been processed. This improved the quality of the data sufficiently for it to be migrated to the new corporate database. The new registration guide and the revamped quality assurance procedures were together successful in reducing the number of on-going errors from two per day to less than one per week. Further InformationStereochemical Problems >> (Powerpoint 164kB) Problem Compound Page >> |
||||||||