TY - JOUR
T1 - Genetic classification of populations using supervised learning
AU - The International Schizophrenia Consortium (ISC)
AU - Bridges, Michael
AU - Heron, Elizabeth A.
AU - O'Dushlaine, Colm
AU - Segurado, Ricardo
AU - Morris, Derek W.
AU - Corvin, Aiden
AU - Gill, Michael
AU - Pinto, Carlos
AU - O'Dushlaine, Colm
AU - Kenny, Elaine
AU - Quinn, Emma M.
AU - Gill, Michael
AU - O'Donovan, Michael C.
AU - Kirov, George K.
AU - Craddock, Nick J.
AU - Holmans, Peter A.
AU - Williams, Nigel M.
AU - Georgieva, Lucy
AU - Nikolov, Ivan
AU - Norton, N.
AU - Williams, H.
AU - Toncheva, Draga
AU - Milanova, Vihra
AU - Owen, Michael J.
AU - Hultman, Christina M.
AU - Lichtenstein, Paul
AU - Thelander, Emma F.
AU - Sullivan, Patrick
AU - McQuillin, Andrew
AU - Choudhury, Khalid
AU - Datta, Susmita
AU - Pimm, Jonathan
AU - Thirumalai, Srinivasa
AU - Puri, Vinay
AU - Krasucki, Robert
AU - Lawrence, Jacob
AU - Quested, Digby
AU - Bass, Nicholas
AU - Gurling, Hugh
AU - Crombie, Caroline
AU - Fraser, Gillian
AU - Kuan, Soh Leh
AU - Walker, Nicholas
AU - St Clair, David
AU - Blackwood, Douglas H.R.
AU - Muir, Walter J.
AU - McGhee, Kevin A.
AU - Pickard, Ben
AU - Malloy, Pat
AU - Maclean, Alan W.
N1 - Funding Information:
We thank the individuals and families who contributed data to the International Schizophrenia Consortium. We are grateful to the reviewers for their constructive comments. We thank the members of the Statistical Genetics Unit in the Neuropsychiatric Genetics Group for helpful comments and advice at all stages of this work. We acknowledge Anthony Ryan for reviewing and commenting on this manuscript. The authors would like to acknowledge support from the Cambridge Centre for High Performance Computing where this work was carried out, and also to Stuart Rankin for computational assistance. Additionally we acknowledge Steve Gull for useful discussions and for the use of MemSys in this application.
PY - 2011/1/1
Y1 - 2011/1/1
N2 - There are many instances in genetics in which we wish to determine whether two candidate populations are distinguishable on the basis of their genetic structure. Examples include populations which are geographically separated, case-control studies and quality control (when participants in a study have been genotyped at different laboratories). This latter application is of particular importance in the era of large scale genome wide association studies, when collections of individuals genotyped at different locations are being merged to provide increased power. The traditional method for detecting structure within a population is some form of exploratory technique such as principal components analysis. Such methods, which do not utilise our prior knowledge of the membership of the candidate populations. are termed unsupervised. Supervised methods, on the other hand are able to utilise this prior knowledge when it is available. In this paper we demonstrate that in such cases modern supervised approaches are a more appropriate tool for detecting genetic differences between populations. We apply two such methods, (neural networks and support vector machines) to the classification of three populations (two from Scotland and one from Bulgaria). The sensitivity exhibited by both these methods is considerably higher than that attained by principal components analysis and in fact comfortably exceeds a recently conjectured theoretical limit on the sensitivity of unsupervised methods. In particular, our methods can distinguish between the two Scottish populations, where principal components analysis cannot. We suggest, on the basis of our results that a supervised learning approach should be the method of choice when classifying individuals into pre-defined populations, particularly in quality control for large scale genome wide association studies.
AB - There are many instances in genetics in which we wish to determine whether two candidate populations are distinguishable on the basis of their genetic structure. Examples include populations which are geographically separated, case-control studies and quality control (when participants in a study have been genotyped at different laboratories). This latter application is of particular importance in the era of large scale genome wide association studies, when collections of individuals genotyped at different locations are being merged to provide increased power. The traditional method for detecting structure within a population is some form of exploratory technique such as principal components analysis. Such methods, which do not utilise our prior knowledge of the membership of the candidate populations. are termed unsupervised. Supervised methods, on the other hand are able to utilise this prior knowledge when it is available. In this paper we demonstrate that in such cases modern supervised approaches are a more appropriate tool for detecting genetic differences between populations. We apply two such methods, (neural networks and support vector machines) to the classification of three populations (two from Scotland and one from Bulgaria). The sensitivity exhibited by both these methods is considerably higher than that attained by principal components analysis and in fact comfortably exceeds a recently conjectured theoretical limit on the sensitivity of unsupervised methods. In particular, our methods can distinguish between the two Scottish populations, where principal components analysis cannot. We suggest, on the basis of our results that a supervised learning approach should be the method of choice when classifying individuals into pre-defined populations, particularly in quality control for large scale genome wide association studies.
UR - http://www.scopus.com/inward/record.url?scp=79955939428&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79955939428&partnerID=8YFLogxK
U2 - 10.1371/journal.pone.0014802
DO - 10.1371/journal.pone.0014802
M3 - Article
C2 - 21589856
AN - SCOPUS:79955939428
SN - 1932-6203
VL - 6
JO - PloS one
JF - PloS one
IS - 5
M1 - e14802
ER -