WebContentment Logo WebContentment
Web Usability & Data Content Services
The Trading Company of Peter Millington
Home Page
Services
<< Case Studies
Presentations
Current Projects
Personal Profile
Contact Us
 

Major Chemical Data Cleansing Project


The Problem

Following pharmaceutical company mergers, a local chemical database needed to be migrated into a new corporate database. The standards of the old database were different to the new corporate database, especially for the representation of salted compounds and stereoisomers. Also, increased compound acquisitions for high throughput screening and combinatorial chemistry had overloaded the quality assurance system, resulting in eroded data quality. These issues needed to be addressed before the data could be converted for the new database.

Project Aims

  • To review local standards to ensure compatibility with emerging corporate conventions and new technology.
  • To modify records to match any revised standards, if necessary and practicable.
  • To correct errors, discrepancies, etc.
  • To eliminate variability - or at least get data into a more consistent state.
  • To maintain quality during parallel running, prior to final migration.

Initial Assessment

Time was spent exploring the data, to identify and quantify the problems. What was found was:

Chemical Structure Issues

  • Ambiguous chemical representation - especially stereochemical and geometric isomers.
  • Some specific technological issues, where information could be lost during conversion. These mostly related to special chemical bonds - stereo, pi, dative, etc. - affecting stereoisomers, and complexes such as ferrocenes.

Non-Structural Fields

  • Many fields were free-text, with little automatic error checking.
  • Many names and terms needed to be standardised - e.g. the "University of East Anglia" was given numerous ways - "UEA", "U.E.A", "Uni East Anglia", etc.
  • Physical properties often fitted patterns that could be parsed to extract numerical values, solvents, units, etc.

Errors and Discrepancies

  • Information entered in the wrong field.
  • Discrepancies between drawn chemical structures and corresponding textual notes.

Fixability

  • Most inconsistencies were easy to fix, and often a single update command could be used for a large block of records.
  • Chemical structure changes took longer because they needed to be drawn manually.
  • In the worst cases, checking information against laboratory notebooks and other records in the company archives could take up to 30 minutes per compound.
  • Clearly quality assurance could be a never-ending task.

All in all, over 20,000 compounds were identified that required review.

Solutions

First of all, local standards were reviewed in association with chemists and the developers of the new corporate database. This resulted in a new registration guide for chemists, which corrected some misconceptions.

Strategies

Next, a quality plan was prepared. A list of the various problem categories was made and put in prioritised order. In general, the easiest problems were listed first, with the remainder in order of increasing complexity. However, some categories were promoted in the list - for instance if information would be lost during conversion.

The strategy was to work as far through the list as possible during a pre-set time of two months. It was decided that difficult problems would not be checked against archive records, and therefore it was accepted that some problems would not be resolved prior to migration. An appropriate logging system was also set up so that we knew which compounds had and had not been reviewed or fixed.

Methods & Tools

Text fields were generally standardised using Microsoft Excel. Unique values were downloaded into a spreadsheet where standardised values were manually entered against the originals. Macros were then used to generate appropriate update commands for records in the main database.

Structures and associated notes were handled using Accord for Excel, which allows structures to be viewed alongside normal text and numeric data. Notes were processed in a similar way to the other text fields. For structures, columns were added indicating the fixes they required, which were then used to prepare lists of manual drawing changes. A PhD chemist was engaged on contract from a chemical database publisher to assist with assessment and drawing for the duration of the project.

To help assess stereochemical configurations, a simple Visual Basic tool using components from MDL's Chemical Business Rules Manager software. A structure could be cut and pasted into the tool, which then identified and colour-coded the types of stereocentres and geometrical bonds.

QA for New Compounds

Lastly, arrangements for on-going quality assurance were revamped to prevent errors at source. It was possible to retro-fit some additional automatic error checking to the registration application. However, the new registration guide was the key factor. This was distributed to all the chemists and endorsed by their managers.

New compounds were monitored daily for errors, which were queried immediately with the registering chemist. In part, this was an educational exercise. Special web pages were written concerning recurring errors, explaining the problem and its implications, and describing how to resolve it. Appropriate copies of these pages were passed to chemists when querying errors. Consequently, the number of queries was reduced.

Outcome

At the end of the project, over 95% of the 20,000 compounds that needed review had been processed. This improved the quality of the data sufficiently for it to be migrated to the new corporate database.

The new registration guide and the revamped quality assurance procedures were together successful in reducing the number of on-going errors from two per day to less than one per week.

Further Information

Stereochemical Problems >> (Powerpoint 164kB)    Problem Compound Page >>


© Copyright 2003, WebContentment, Last updated: 10-Jan-2006