For the purposes of a big online recruitment company, we have developed a high-performance web-scraping solution that gathers and processes high-volumes of new job posts that come from several diverse sources.
We have built a complex system that collects AdSense data (impressions, clicks, applies etc.) related to vacant job postings. Then by analyzing the job posts and the metrics collected for each job posting, we predict which of the new job postings will be more attractive (that will generate more impressions, applies etc.). Based on this, we determine to which vendors (job boards) to distribute them, how many to spend for advertising on each job and few other similar recommendations.
Among other things, we have been doing the configuration of a Hadoop cluster (Cloudera distribution) and the deployment on Amazon AWS. We have designed the HBase tables for the real-time high volume system. We have executed the deployment of MapReduce jobs in Pig Latin, Java and Python for Natural Language Processing. Later we continued with building of the prediction models and applying them as Decision Support systems. Most demanding and challenging task was the parallel implementation of the algorithms for feature selection and classification based on the MapReduce paradigm.