Home > Research

last updated February 10, 2023

Deep learning mdeical image analysis

Medical imaging plays an essential role in the detection and diagnosis of numerous diseases. There has been a variety of research in computer-aided diagnosis of medical images to improve diagnostic efficiency and ensure high accuracy. The medical images containing huge amounts of physiological information are exactly what the data-hungry deep learning paradigms need to build valuable intelligent auxiliary systems.

Through the collaboration with the E-Da Hospital, I-Shou University, Taiwan, we have the access to the single photon emission computed tomography (SPECT) imaging. We have successfully applied advanced machine learning methods to improve the multi-class classification of SPECT images for Parkinson's disease stage prediction.

Chest X-ray (CXR) is widely used to diagnose conditions affecting the chest, its contents, and its nearby structures. We use deep convolutional neural network models to extract feature representations and to identify possible diseases in CXR. We also use transfer learning combined with large open-source image data sets to resolve the problems of insufficient training data and optimize the classification model.

(Back to Top)

Latent class modeling

My primary statistical methodology research is focused on the development of statistical methods for problems in which the process of interest is unobservable. In many medical studies, the definitive outcome is inaccessible, and a valid surrogate endpoint is then measured in place of the clinically most meaningful endpoint. I have developed a latent variable model for analyzing this kind of data structure. The model summarizes the unobservable definitive outcome as an underlying categorical variable and incorporates covariate effects on both underlying and measured variables. Significantly, I develop a model framework that guarantees identifiability of the two types of covariate effects.

I also provide theory and practical methods for selecting the number of underlying variable categories. I proposed approach is based on an analogous method used in factor analysis and does not require repeated model fitting under different numbers of categories.

Statisticians typically estimate the parameters of latent variable models using the Expectation-Maximization algorithm. I propose an alternative two-stage optimization-based approach to model fitting. The proposed approach is theoretically justifiable, directly checks the conditional independence assumption, and converges much faster than the full likelihood approach when analyzing high-dimensional data.

I also propose a Bayesian framework to perform the joint estimation of the number of latent classes and model parameters. The proposed approach applies the reversible jump Markov chain Monte Carlo to analyze finite mixtures of multivariate multinomial distributions. We also develop a procedure for the unique labelling of the classes.

(Back to Top)

Genetic analysis

I am also working on genetic analysis studies. The first study is on endophenotype validation. Endophenotypes, which involve the same biological pathways as diseases but presumably are closer to the relevant gene action than diagnostic phenotypes, have emerged as an important concept in the genetic studies of complex diseases. In this paper, we develop a formal statistical methodology for validating endophenotypes. We also propose an index to be used as operational criteria of validation.

The second study is for the analysis of gene expression microarray data. Through the analysis of spike-in, RT-PCR and cross-laboratory benchmark datasets, we evaluate combinations of the most popular preprocessing and differential expression detection methods in terms of accuracy and inter-laboratory consistency. Our results provide general guidelines for selecting preprocessing and differential expression methods in analyzing Affymetrix GeneChip array data.

The third study is on genotype imputation accuracy. Many researchers use the genotype imputation approach to predict the genotypes at rare variants that are not directly genotyped in the study sample. One important question in genotype imputation is how to choose a reference panel that will produce high imputation accuracy in a population of interest. Using whole genome sequence data from the Genetic Analysis Workshop 18 data set, this report compares genotype imputation accuracy among reference panels representing different degrees of genetic similarity to a study sample of admixed Mexican Americans.

The forth study uses a Bayesian formulation of a clustering procedure to identify gene-gene interactions under case-control studies, called the Algorithm via Bayesian Clustering to Detect Epistasis (ABCDE). The ABCDE uses Dirichlet process mixtures to model SNP marker partitions, and uses the Gibbs weighted Chinese restaurant sampling to simulate posterior distributions of these partitions. This study also develops permutation tests to validate the disease association for SNP subsets identified by the ABCDE, which can yield results that are more robust to model specification and prior assumptions.

The fifth study is on identifying copy number variations (CNVs), genomic structural mutations with abnormal gene fragment copies, in next-generation sequencing data. We develop a COpy Number variation detection tool by a BaYesian procedure, CONY, which adopts a hierarchical model and an efficient reversible jump Markov chain Monte Carlo inference algorithm for whole genome and exome sequencing data.

(Back to Top)

Collaboration in medical research

Through the collaboration on a variety of projects, I have the opportunity to provide my statistical expertise in support of health research and motivate my methodological study.

The Beaver Dam Offspring Study is a longitudinal population-based cohort study that is designed to provide key data on the incidence and risk factors of hearing, vision and olfaction impairments among the post-World War II "baby-boom" generation. The principal investigator of the project is Dr. Karen J. Cruickshanks from the University of Wisconsin-Madison. I have worked on this project as a consultant to provide suggestions on statistical methods for analyzing the data and addressing the questions of interest.

I work with Dr. Cheryl Chia-Hui Chen from the Department of Nursing at the National Taiwan University on several projects. I am in charge of the statistical analysis of these projects. These studies aim to determine risk factors associated with cognitive, nutritional, and functional decline in older hospitalized patients.

(Back to Top)

Data science

Recently, I have involved in many data science projects, where we adopt some innovative statistical methods in solving real industrial problems.

As technology advances and manufacturing systems become automated, machine equipment is becoming more sophisticated and expensive. How to make these equipment operate normally and eliminate the failure quickly, it has become an important issue of production management. In order to achieve this goal, modern machine equipment is often equipped with many sensing devices, monitoring machine conditions at any time to ensure that the machine can operate normally for a long time. In the collaboration with the Industrial Technology Research Institute in Taiwan, we apply innovative statistical methods to analyzing a set of sensing data obtained from the MOCVD machines that produce LED chips. We develop the prognostics and health management (PHM) technology for improving the production capacity of these machines including: process parameter extraction technology for device components, abnormal process judgment, and remaining useful life (RUL) prediction.

Highway traffic congestion is a growing problem. This study is intended to use statistical methods to construct five traffic flow prediction models, respectively, for the long holiday, weekend, Monday/Friday, Tuesday to Thursday, and a single holiday. Traffic estimates arising can guide people to avoid traffic jams and also provide the traffic control center as a real-time traffic control strategy reference. Freeway No. 5's holiday, weekend or weekday peak traffic congestion condition and high detector layout density can provide a lot of traffic information for analysis. Therefore, this study use Freeway No. 5 in Taiwan as the scope of the study to develop and validate proposed traffic flow predictive models.

Human-exhaled volatile organic compounds (VOCs) can be altered by lung cancer and become identifiable biomarkers. We used selected ion flow tube mass spectrometry to quantitatively analyze 116 kinds of VOCs in breath samples from 148 lung cancer patients and 168 healthy individuals. We used the eXtreme Gradient Boosting (XGBoost), a machine learning method, to build a model for predicting the occurrence of lung cancer based on quantitative VOC measurements. THe proposed prediction model achieved better performance than other previous approaches, with an accuracy, sensitivity, specificity, and area under the curve (AUC) of 0.89, 0.82, 0.94, and 0.95, respectively.

(Back to Top)

Joint analysis of transition probabilities

I have also worked on a study for analyzing age-related maculopathy (ARM): a leading cause of vision loss in people aged 65 or older. ARM is distinctive in that it is a disease which can progress, regress, disappear, and reoccur. I develop a transition model for jointly studying the relationship of incidence, progression, regression and disappearance probabilities with risk factors. The developed method can be widely applied to other diseases with similar transitional characteristics.

(Back to Top)

Multiple ordinal measurements

Analysis of "multiply-measured" ordinal outcomes is another research topic. Co-authors and I have detailed challenges and strategies for analyzing such data. We also apply generalized estimating equations methodology for analyzing multiple ordinal measurements and develop graphical diagnosis displays to evaluate the adequacy of models.

(Back to Top)