It has been recognizes for a long time that reports of gene mutations in cancer specimens are associated with sequencing artefacts for multiple reasons. Several reports for TP53 mutations are notoriously known to be highly suspicious.

In 2005, we first reported a statistical analysis of the TP53 database using the activity of each TP53 mutant as a criterion to distinguish suspicious reports.
Soussi, T., Kato, S., Levy, P.P. & Ishioka, C. (2005). Reassessment of the TP53 mutation database in human disease by data mining with a library of TP53 missense mutations. Hum Mutat 25: 6-17


Activity of mutant TP53 according to their frequency in the data base. Mutant TP53 were classified into eight categories according to their occurrence in the database (green). Box-and-whisker plots show the upper and lower quartiles and range (box), median value (horizontal line inside the box), and full-range distribution (whisker line). The y-axis corresponds to TP53 transactivation activity, black triangles correspond to 0% (bottom) and 100% (top). Analysis was performed for missense mutants found in tumors excluding cell lines and germline mutations. P-values listed above each bar refer to the comparison with the 160+ category. The Mann–Whitney U-test was used to evaluate statistical significance. NS, not significant; ***, P<0.0001. WAF1 activity corresponds to the raw data reported by Kato et al, included in the UMD TP53 database.


Since 2005, the UMDTP53 database has been curated and provides the scientific community with specific tools to assess the confidence of each TP53 mutant.

In 2012, we performed a novel statistical analysis of the last release of the TP53 mutation database using an original multivariate criteria strategy. The use of multiple independent criteria allowed a strong and robust analysis and led to a marked improvement of the quality of the UMDTP53 mutation database.
Edlund K, Larsson O, Ameur A, Bunikis I, Gyllensten U, Leroy B, Sundstrom M, Micke P, Botling J, Soussi T (2012) Data-driven unbiased curation of the TP53 tumor suppressor gene mutation database and validation by ultradeep sequencing of human tumors. Proc Natl Acad Sci U S A 109: 9551-9556

Principal component analysis (PCA) was used in order to evaluate all of these criteria in a combined analysis. The first four components captured 66% of the total variance and were therefore used to calculate the number of standard deviations (SD) by which each sample deviated from the median. We identified 129 studies (9.7%) that deviated from the median by >2 SD. This SD value has been included in the database to allow each user to work with his or her own dataset.
The curated version of the database does not include these 129 outlier studies.