As early as January, Google Health, a division of Google that focuses on collaboration in health-related research, clinical tools and medical services, released an artificial intelligence model trained on more than 90000 mammograms. The company said it had achieved better results than human radiologists. Google claims that the algorithm can identify more false negatives than previous work, that is, images that look normal but contain breast cancer, But some clinicians, data scientists and engineers question this claim.
According to a retort published today in nature, more than 19 co-authors at mcgill university, city university of new york (CUNY), harvard university and stanford university, google's research lacks detailed methods and codes that undermine its scientific value. In general, science has repeatability problems, and a 2016 survey of 1500 scientists showed that 70 percent of them had at least tried to replicate other scientists' experiments, but failed.
In the field of artificial intelligence, this problem is particularly serious. At the ICML conference in 2019, 30% of the authors failed to submit their code with their papers before the conference started. Research often provides benchmarking results instead of source code, and problems arise when the thoroughness of benchmarks is questioned. A recent report found that 60% to 70% of the answers given by NLP models are embedded in some part of the benchmark training set, indicating that models tend to memorize answers only.
They said Google's breast cancer model research lacked details, including a description of the model development and the data processing and training channels used. Google has omitted the definition of several super parameters of the model architecture and has not disclosed the variables of the dataset used to enhance model training. This can significantly affect its performance, and nature's co authors claim that, for example, one of the data enhancements used by Google could lead to multiple instances of the same patient, thereby skewing the final results.
Google says the code used to train the model has many dependencies on internal tools, infrastructure, and hardware, making its release unfeasible. In deciding not to publish the two training datasets, the company also mentioned the proprietary nature of the two training datasets and the sensitivity of patient health data. but Nature co-authors point out that raw data sharing has become increasingly common in biomedical literature, increasing from less than 1% in the early 2000s to 20% today, and that model predictions and data labels could have been released without disclosing personal information.