last updated February 10, 2023
Deep learning mdeical image analysis
Medical imaging plays an essential role in the detection and diagnosis of numerous diseases.
There has been a variety of research in computer-aided diagnosis of medical images to
improve diagnostic efficiency and ensure high accuracy.
The medical images containing huge amounts of physiological information are
exactly what the data-hungry deep learning paradigms need to build valuable intelligent auxiliary systems.
Through the collaboration with the E-Da Hospital, I-Shou University, Taiwan,
we have the access to the single photon emission computed tomography (SPECT) imaging.
We have successfully applied
advanced machine learning methods to improve
the multi-class classification of SPECT images for Parkinson's disease stage prediction.
Chest X-ray (CXR) is widely used to diagnose conditions affecting the chest, its contents,
and its nearby structures.
We use deep convolutional neural network models to extract feature
representations and to identify possible diseases in CXR. We also use transfer learning
combined with large open-source image data sets to resolve the problems of insufficient training
data and optimize the classification model.
Latent class modeling
My primary statistical methodology research is focused on the development of statistical methods for problems in
which the process of interest is unobservable. In many medical studies, the definitive outcome is inaccessible,
and a valid surrogate endpoint is then measured in place of the clinically most meaningful endpoint.
I have developed
a latent variable model for analyzing this kind of data structure.
The model summarizes the unobservable definitive outcome as an underlying categorical variable and
incorporates covariate effects on both underlying and measured variables.
Significantly, I develop
a model framework that guarantees identifiability
of the two types of covariate effects.
I also provide theory and practical methods for
selecting the number of underlying variable categories.
I proposed approach is based on an analogous method used in factor analysis
and does not require repeated model fitting under different numbers of categories.
Statisticians typically estimate the parameters of latent variable models using the Expectation-Maximization algorithm.
I propose
an alternative two-stage optimization-based approach to model fitting.
The proposed approach is theoretically justifiable, directly checks the conditional independence assumption,
and converges much faster than the full likelihood approach when analyzing high-dimensional data.
I also propose
a Bayesian framework to perform the joint estimation of the number of latent classes and model parameters.
The proposed approach applies the reversible jump Markov chain Monte Carlo to analyze finite mixtures of
multivariate multinomial distributions.
We also develop
a procedure for the unique labelling of the classes.
Genetic analysis
I am also working on genetic analysis studies.
The first study is on endophenotype validation.
Endophenotypes, which involve the same biological pathways as diseases but presumably
are closer to the relevant gene action than diagnostic phenotypes, have emerged as an
important concept in the genetic studies of complex diseases.
In this paper, we develop
a formal statistical methodology for validating endophenotypes.
We also propose an index to be used as operational criteria of validation.
The second study is for the analysis of gene expression microarray data.
Through the analysis of spike-in, RT-PCR and cross-laboratory benchmark datasets,
we
evaluate combinations of the most popular preprocessing and
differential expression detection methods in terms of accuracy and inter-laboratory consistency.
Our results provide general guidelines for selecting preprocessing and differential expression methods
in analyzing Affymetrix GeneChip array data.
The third study is on genotype imputation accuracy.
Many researchers use the genotype imputation approach to predict the genotypes at rare variants that
are not directly genotyped in the study sample.
One important question in genotype imputation is how to choose a reference panel that will produce
high imputation accuracy in a population of interest.
Using whole genome sequence data from the Genetic Analysis Workshop 18 data set,
this report
compares genotype imputation accuracy among reference panels
representing different degrees of genetic similarity
to a study sample of admixed Mexican Americans.
The forth study uses
a Bayesian formulation of a clustering procedure to
identify gene-gene interactions under case-control studies,
called the Algorithm via Bayesian Clustering to Detect Epistasis (ABCDE).
The ABCDE uses Dirichlet process mixtures to model SNP marker partitions, and uses the Gibbs weighted
Chinese restaurant sampling to simulate posterior distributions of these partitions.
This study also develops permutation
tests to validate the disease association for SNP subsets identified by the ABCDE, which can yield results that
are more robust to model specification and prior assumptions.
The fifth study is on
identifying copy number variations (CNVs),
genomic structural mutations with abnormal gene fragment copies, in next-generation sequencing data.
We develop a COpy Number variation detection tool by a BaYesian procedure,
CONY, which adopts a hierarchical model and an efficient
reversible jump Markov chain Monte Carlo inference algorithm
for whole genome and exome sequencing data.
Collaboration in medical research
Through the collaboration on a variety of projects,
I have the opportunity to provide my statistical expertise in support of health research and
motivate my methodological study.
The Beaver Dam Offspring Study
is a longitudinal population-based cohort study
that is designed to provide key data on the incidence and risk factors of hearing,
vision and olfaction impairments among the post-World War II "baby-boom" generation.
The principal investigator of the project is Dr. Karen J. Cruickshanks from the University of Wisconsin-Madison.
I have worked on this project as a consultant to provide suggestions on statistical methods for
analyzing the data and addressing the questions of interest.
I work with Dr. Cheryl Chia-Hui Chen from the Department of Nursing at the National Taiwan University
on several projects. I am in charge of the statistical analysis of these projects.
These studies aim to determine risk factors associated with cognitive, nutritional, and functional decline
in older hospitalized patients.
Data science
Recently, I have involved in many data science projects, where we adopt
some innovative statistical methods in solving real industrial problems.
As technology advances and manufacturing systems become automated,
machine equipment is becoming more sophisticated and expensive.
How to make these equipment operate normally and eliminate the failure quickly,
it has become an important issue of production management.
In order to achieve this goal, modern machine equipment is often equipped with many sensing devices,
monitoring machine conditions at any time to ensure that the machine can operate normally for a long time.
In the collaboration with the Industrial Technology Research Institute in Taiwan,
we apply innovative statistical methods to analyzing a set of sensing data obtained from the MOCVD machines
that produce LED chips.
We develop the prognostics and health management (PHM) technology for improving the production capacity of
these machines including:
process parameter extraction technology for device components,
abnormal process judgment, and
remaining useful life (RUL) prediction.
Highway traffic congestion is a growing problem.
This study is intended to
use statistical methods to construct five traffic flow prediction models,
respectively, for the long holiday, weekend, Monday/Friday, Tuesday to Thursday, and a single holiday.
Traffic estimates arising can guide people to avoid traffic jams and also provide the traffic control
center as a real-time traffic control strategy reference.
Freeway No. 5's holiday, weekend or weekday peak traffic congestion condition and
high detector layout density can provide a lot of traffic information for analysis.
Therefore, this study use Freeway No. 5 in Taiwan as the scope of the study to develop and
validate proposed traffic flow predictive models.
Human-exhaled volatile organic compounds (VOCs) can be altered by lung cancer and
become identifiable biomarkers.
We used selected ion flow tube mass spectrometry to quantitatively analyze 116 kinds of VOCs
in breath samples from 148 lung cancer patients and 168 healthy individuals.
We used the eXtreme Gradient Boosting (XGBoost),
a machine learning method, to build a model for predicting the occurrence of lung cancer based on
quantitative VOC measurements.
THe proposed prediction model achieved better performance than other previous approaches,
with an accuracy, sensitivity, specificity, and area under the curve (AUC) of 0.89, 0.82, 0.94, and 0.95,
respectively.
Joint analysis of transition probabilities
I have also worked on a study for analyzing age-related maculopathy (ARM): a leading cause of vision loss in people aged 65 or older. ARM is distinctive in that it is a disease which can progress, regress, disappear, and reoccur. I develop a transition model for jointly studying the relationship of incidence, progression, regression and disappearance probabilities with risk factors. The developed method can be widely applied to other diseases with similar transitional characteristics.
(Back to Top)Multiple ordinal measurements
Analysis of "multiply-measured" ordinal outcomes is another research topic. Co-authors and I have detailed challenges and strategies for analyzing such data. We also apply generalized estimating equations methodology for analyzing multiple ordinal measurements and develop graphical diagnosis displays to evaluate the adequacy of models.
(Back to Top)