This paper develops the GalaXC algorithm for Extreme Classification (XC), where the task is to annotate a document with the most relevant subset of labels from an extremely large label set. XC has been successfully applied to many real world web-scale applications like web search, product recommendation, query rewriting etc. Leading XC algorithms generally consider documents and labels as two disjoint sets, even though the two sets can be coming from the same space. Further, many XC algorithms, which can scale to millions of labels, don’t utilize label meta-data. To this end, GalaXC learns document representations by building a joint graph over documents and labels. GalaXC further employs an efficient label attention mechanism to learn label classifiers. This allowed the proposed algorithm to be up to 23% more accurate than leading deep extreme classifiers, while being upto 30-120x faster to train and 10x faster to predict on benchmark datasets. A joint graph over documents and labels allows GalaXC to naturally incorporate auxiliary sources of information. GalaXC is particularly well suited for warm start scenarios where predictions need to be made on training points with partially revealed labels. GalaXC was found to be up to 13% more accurate than XC algorithms specifically developed for the warm start setting. An efficient implementation of GalaXC allowed it be trained on a dataset with 50M labels and 97M training documents in less than 100 hours on 4xV100 GPUs. In A/B tests conducted on Bing search engine, GalaXC could improve the Click Yield (CY) and coverage by 1.52% and 1.11% respectively. Code for GalaXC will be made publicly available.