Distributed document classification using Apache Spark

A scalable document classification, built using the Apache Spark APIs to on Reuters Corpus. In this challenge we designed improved classifiers and pre-processing modules that scale on the distributed Google Compute Cluster platform for setups. We achieved 80+% accuracy on the classification.

Details of the project can be found here and here(wiki).