Test collections has been a crucial factor for the development of information retrieval systems. Constructing a test collection requires human annotators to assess the relevance of massive query-document pairs (tasks). Relevance annotations acquired through crowdsourcing platforms alleviate the enormous cost of this process but they are often noisy. Existing models to infer true relevance labels from noisy annotations mostly assume that annotations are generated independently, based on which a probabilistic graphical model is designed to rebuild the annotation generation process. In this paper, we relax the independence assumption by assuming a Gaussian process on the true relevance labels of tasks to model their correlation. We propose a new crowd annotation generation model named CrowdGP, where the true relevance labels, the difficulty and bias of tasks, and the competence and bias of annotators are modelled through a Gaussian process and multiple Gaussian variables respectively. The CrowdGP model shows superior performance in terms of interring true relevance labels compared with state-of-the-art baselines on two crowdsourcing relevance datasets. The experiments also demonstrate its effectiveness in terms of predicting relevance labels for new tasks that has no crowd annotations, which is a new functionality of CrowdGP. Ablation studies demonstrate that the effectiveness is attributed to the modelling of task correlation based on the axillary information of tasks and the prior relevance information of documents to queries.