Latent Dirichlet Allocation in STRUCTURE and Topic Models    Oct 15, 2015

by Stephen Rong

Back in August, Sohini led a lab journal club on Latent Dirichlet Allocation (LDA), following a recent preprint on bioarxiv by Shiraishi et al. (2015), which applied LDA to the problem of inferring mutation signatures for cancer. The authors make some interesting comments in their discussion about how their use of LDA parallels previous uses of LDA in population structure analysis and topic modeling:

LDA is a mixed-membership model or admixture model, which are used to characterize latent clusters within a dataset. In mixed-membership modeling, observations don’t have to belong to a single cluster, but can be mixtures of different clusters in varying proportions. LDA is a probabilistic mixed-membership model that assumes independence between different clusters (Blei et al. 2003).

Two classic applications of LDA are:

(1) Topic models in natural language processing, where the task is to discover topics in a set of documents, and categorize each document based on the topics to which it belongs. Blei et al. (2003) originally formulated LDA for topic modeling, and LDA has since become widely used in natural language processing and related disciplines (Blei 2012).

(2) Population structure analysis in population genetics, where the task is to discover populations in genetic data, and infer admixture proportions for individuals. STRUCTURE (2000) was the first such model-based mixed-membership tool for analyzing population structure, and actually predates Blei et al.’s (2003) formulation of LDA for topic modeling.

In both applications, the structure in the data being modeled is almost identical, though the vocabulary is different. In topic models, documents are viewed as mixtures of topics that contribute different words to the document. In population structure analysis, individuals are viewed as mixtures of populations that contribute different alleles to the individual for each haplotype at each locus. Thus, words = alleles, documents = individuals, and topics = populations.

STRUCTURE can therefore be viewed as a complex precursor to the more general framework of LDA that deals with modeling biological entities like loci and genotypes. Conversely, methods like variational inference methods for doing computationally-efficient inference on LDA models (Blei et al. 2003) have later been used to develop computationally-efficient versions of STRUCTURE, i.e. fastSTRUCTURE (Raj et al. 2014).

References

[1] Shiraishi, Y., Tremmel, G., Miyano, S., & Stephens, M. (2015). A simple model-based approach to inferring and visualizing cancer mutation signatures. bioRxiv doi: http://dx.doi.org/10.1101/019901

[2] Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3(4-5), 993–1022.

[3] Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM 55(4), 77-84.

[4] Pritchard, J. K., Stephens, M., & Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics, 155(2), 945–959.

[5] Raj, A., Stephens, M., & Pritchard, J. K. (2014). fastSTRUCTURE: Variational Inference of Population Structure in Large SNP Datasets. Genetics, 197(June), 573–589.