Pai-Hsi Huang is an alumnus of SEQAM. He started working with Professor Vladimir Pavlovic back in the year of 2002. He obtained both his Master's and Ph.D in Computer Science supervised by Prof. Pavlovic in 2004 and 2008 respectively. He earned another Master's degree in Statistics in the year of 2005. His research interests are in Machine Learning and Data Mining with bioinformatics application. His focus was on statistical learning models that are known to be interpretable; these models tend to offer more insights to the structure of the data and perhaps the underlying generation process. He is currently working at Novartis as a software engineer.
- He was working on protein homology detection, with a goal to develop a principled way of using a generative model (in this case, hidden Markov models) as a feature extractor and feed the features to an interpretable, discriminative model (for example, Logistic Regression model) to perform the classification task. More specifically, he was extremely interested in developing interpretable models such that we may gain some insights toward the underlying processes that generated the biosequences.
- He was also interested in semi-supervised learning algorithms, in which one taps into the abundant unlabeled data in the hope to populate the training sets and thus lower the variance of the estimates.
- He used duration-explicit hidden Markov models to show that, a set of critical positions and the distances (number of residues) between each neighboring pair are sufficient to model a group of functionally related proteins. The required number of such critical positions is approximately a quarter of the average length of the functionally related proteins.
- He had also worked on a protein secondary structure prediction problem as a course project. The problem was very challanging because, first, the only information we had was the primary sequence of the protein and second, we need to extract fixed-length features from variable-length protein sequences. We attempted to tackle this problem using SVM and clustering methods.