Functional data analysis is a statistical area analysing the data in the form of functions, images, shapes or even more general random objects. Thanks to the advance of modern technology, more and more such data are being routinely collected; and thanks to the fast improvement of computational machine learning, these data are being analysed and influencing our life in many aspects. For instance, when unlocking a smart phone using cameras capturing your facial characteristics or sensors reading your finger prints, your smart phone is collecting images, detecting the signals and comparing them to the pre-stored information. As another example, one way to understand the subtypes of the attention deficit hyperactivity disorder (ADHD) is to study the shapes of Corpus Callosum, which often serve as a guidance on diagnosing. Despite the vitality and importance of analysing functional data, the theoretical guarantees of handy statistical methods are often lacking when applying them to functional data. Without theoretical guarantees, interpreting the analysis results is misleading and can be dangerous.
The state-of-the-art theoretical developments in functional data analysis, especially functional clustering methods, are suffering from the following issues.
1, The majority of the exiting literature relies on the functional principal component analysis, which maps the infinite-dimensional covariance operator to a low-dimensional space, and the analysis in the infinite-dimensional functional space is transformed to a manageable space. However, the success of such transformation relies on the assumption that there is an upper bound on the number of non-zero eigenvalues of the covariance operator. This is a strong condition, since it excludes many standard functional spaces, e.g. Sobolev spaces.
2, The majority of the existing theoretical results are asymptotic, in the sense that the results state the asymptotic performances of some statistical procedures, without detailing how fast these procedures reach a desirable rate, or how large the sample size needs to be in order to reach a certain accuracy level. Lacking fixed sample results also hinders the analysis of high-dimensional data.
In this research proposal, I will start with a specific problem -- providing theoretical guarantees of the convergence of the functional Lloyd's algorithm, which is the default of the k-means clustering method. With this in hand, I will then provide fixed sample version of the error controls of the functional k-means clustering methods. The success of these two steps will shed light on providing theoretical guarantees on clustering more general objects in manifold learning, which will be the starting point of a further programme.
The agenda seems standard, because k-means clustering method is standard and handy in many application areas. But in the functional spaces, there is no theoretical guarantee on the convergence, no theoretical understanding on when the algorithms should converge, not to mention knowing how good the final clustering estimators are, how many iterations are needed, how many samples are needed, what kind of function sampling schemes is the best. This work will provide an answer to these questions.
|