1 line
23 KiB
JSON
1 line
23 KiB
JSON
{"url": "http://mondego.ics.uci.edu/projects/yelp/", "content": "\n<!-- saved from url=(0032)http://www.ics.uci.edu/~vpsaini/ -->\n<html><head><meta charset='utf-8'>\n<title>The Yelp dataset challenge - Multilabel Classification of Yelp reviews into relevant categories</title>\n<meta name=\"description\" content=\"Yelp dataset challenge\">\n<meta name=\"keywords\" content=\"Yelp, challenge, data mining\">\n<style>\nA { COLOR: #62021; TEXT-DECORATION: none ; }\na:visited{ color: blue }\nA:hover { TEXT-DECORATION: underline }\n.lowspace { min-height: 5}\n.midspace { min-height: 10}\n.highspace { min-height: 15}\n.media {width:459;height:358;}\n.left{float:left;}\n.right{float:right;}\n.photo{height: 309;\n\twidth: 430;\n\tpadding-left: 57;\n\tpadding-bottom: 25;}\n\ttable {\n\t\tpadding-left: 20;\n\t\tpadding-right: 20;\n\t}\n\ttd {\n\t\ttext-align: justify;\n\t\tfont-size: 110%;\n\t\tline-height: 150%;\n\t\tmargin-bottom: 48px;\n\t\tmargin-top: -22px;\n\t\tfont-family: Georgia;\n\t}\n\tdivimg {\n\t\ttext-align: center;\n\t\tfont-size: 110%;\n\t\tline-height: 150%;\n\t\tmargin-bottom: 48px;\n\t\tmargin-top: -22px;\n\t\tfont-family: Georgia;\n\t}\n\t.mondego{width:300px;\n\t\theight:110px;\n\t\tbackground: transparent url('http://mondego.ics.uci.edu/img/mondego-banner.png') -0px -0px no-repeat;}\n\t\t</style>\n\n\t\t<style type=\"text/css\"></style><style type=\"text/css\"></style></head>\n\t\t<body link=\"blue\" alink=\"blue\" vlink=\"violet\">\n\t\t\t<script>\n\t\t\t (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){\n\t\t\t (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),\n\t\t\t m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)\n\t\t\t })(window,document,'script','//www.google-analytics.com/analytics.js','ga');\n\n\t\t\t ga('create', 'UA-41064162-1', 'uci.edu');\n\t\t\t ga('send', 'pageview');\n\n\t\t\t</script>\n\t\t\t\t\t\t\n\n\t\t\t<table>\n\t\t\t\t<tbody><tr><td width=\"20\"><img src=\"./files/logo.png\" width=\"100\" height=\"100\">\n\t\t\t\t\t<!--<tr><td width=20><img src=\"images/shipra-snow.JPG\" width=600 height=400></img>-->\n\t\t\t\t</td><td width=\"20\">\n\t\t\t</td><td> <h2><a href=\"http://www.yelp.com/dataset_challenge/\">Yelp Dataset Challenge</a></h2>\n\n\t\t\t<i>\n\t\t\t\tThe Team<br>\n\t\t\t\t<a href =\"http://www.ics.uci.edu/~hsajnani/\">Hitesh Sajnani</a>, <a href=\"http://www.linkedin.com/in/sainivaibhav\">Vaibhav Saini</a>, <a href =\"http://www.linkedin.com/in/kusumkumar\">Kusum Kumar</a>\n\t\t\t\t, <a href =\"http://www.linkedin.com/in/egabrielova\">Eugenia Gabrielova </a>, <a href=\"http://www.linkedin.com/in/pramitc\">Pramit Choudary</a>, <a href=\"http://www.ics.uci.edu/~lopes/\">Cristina Lopes</a> <br>\n\t\t\t</i>\n\t\t</div></div></td>\n\t\t<td>\n\t\t\t<div class=\"mondego\"></div>\n\t\t</td></tr></tbody></table>\n\t\t<table>\n\t\t\t<tbody>\n\t\t\t\t<tr><td><br><h3>Classifying Yelp reviews into relevant categories</h3>\n\t\t\t\t\tYelp users give ratings and write reviews about businesses and services on Yelp. These reviews and ratings help other Yelp users to evaluate a business or a service and make a choice. While ratings are useful to convey the overall experience, they do not convey the context which led a reviewer to that experience. \n\t\t\t\t\tFor example, consider a yelp review about a restaurant which has 4 stars: <br>\n\t\t\t\t\t<blockquote><i><strong>\"They have the best happy hours, the food is good, and service is even better. When it is winter we become regulars\".</strong></i></blockquote> <br> \n\n\t\t\t\t\tIf we look at only the rating, it is difficult to guess why the user rated the restaurant as 4 stars. However, after reading the review, it is not difficult to identify that the review talks about good \"food\", \"service\" and \"deals/discounts\" (happy hours). <br> <br>\n\t\t\t\t\t\n\t\t\t\t\tA quick inspection of few hundred reviews helped us to decide important categories that are frequent in the reviews. We found 5 categories which include \u201cFood\u201d, \u201cService\u201d, \u201cAmbience\u201d, \u201cDeals/Discounts\u201d, and \u201cWorthiness\u201d. \"Food\" and \"Service\" categories are easy to interpret. \"Ambience\" category relates to the d\u00e9cor, and look and feel of the place. \"Deals and Discounts\" category correspond to offers during happy hours, or specials run by the venue. \u201cWorthiness\u201d category can be summarized as value for money. Users often express the sentiment whether the overall experience was worth the money. It is important to note that \"Worthiness\" category is different from the \u201cPrice\u201d attribute already provided by Yelp. \"Price\" captures whether the venue is \u201cinexpensive\u201d, \u201cexpensive\u201d or \u201cvery expensive\u201d. <br> <br>\n\t\t\t\t\t\n\t\t\t\t\tThis high level categorization of reviews into relevant categories can help user to understand why the reviewer rated the restaurant as \u201chigh\u201d or \u201clow\u201d. This information can help other yelpers to make a personalized choice, especially when one does not have much time to spend on reading the reviews. Moreover, such categorization can also be used to rank restaurants according to these categories.<br><br>\n\t\t\t\t\t\n\t\t\t\t\tWe formulted the task of classifying a review into relevant categories as a learning problem. However, since a review is inexclusively associated with multiple categories at the same time, it is not a simple binary classification or a multi-class classification. It is rather a multi-label classification problem. <br>\n\n\t\t\t\t\tHere is a short video describing how (and why?) Yelp can build some cool features using this categorization: <br> <br> \n\t\t\t\t\t<iframe class=\"media\" src=\"http://www.youtube.com/embed/jPQCiSmxwrg\" frameborder=\"0\" allowfullscreen></iframe>\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t<!--For example, in case of a restaurant, the rating might be influenced by the food, or the ambience, or the service, or the discounts offered by the restaurant, maybe all or some combination of these. This information is not conceivable from only rating, however, it is present in the reviews which users write. We conjecture that such contextual information can be automatically extracted from the reviews. This contextual information when presented to the user by classifying reviews into various relevant categories can prove to be very effective in making an informed decision. Having such a high level categorization can allow users to make a quick personalized choice when one does not have much time to spend on reading the reviews. Moreover, such information can also be used to rank venues based on the categories. --><br>\n\t\t\t\t\t\n\t\t<!--This will help users to choose a venue according to the criteria they value more. For example, a user may give more weightage to food, whereas other may value ambience and service more. Currently, yelp users have no idea if reviewers priority of ambience, food, service, price was same as theirs. We choose {Food, Service, Ambience, Deals/Discount, and Price} as our categories for classification. .<br>\n\t-->\n</tr>\n<!--\n<tr>\n\t<td><br>\n\t\t<h3>1. Introduction </h3>\n\t\tYelp users give ratings and write reviews about businesses and services on Yelp. These reviews and rating help other Yelp users to evaluate a business or a service and make a choice. While ratings are useful to convey the overall experience, they do not convey the context which led users to that experience. For example, in case of a restaurant, the rating might be influenced by the food, or the ambience, or the service, or the discounts offered by the restaurant, maybe all or some combination of these. This information is not conceivable from only rating, however, it is present in the reviews which users write. We conjecture that such contextual information can be automatically extracted from the reviews. This contextual information when presented to the user by classifying reviews into various relevant categories can prove to be very effective in making an informed decision. Including such high level categorization can allow users to make a quick personalized choice when one does not have much time to spend on reading the reviews. Moreover, such information can also be used to rank venues based on the categories.\n\n\t\tAlthough, the functionality described above is desirable and useful for any kind of business, we limit the scope of our problem for only restaurants. <br>To understand the problem in this context, consider a Yelp review:\n\n\t\t\n\t\t<br>\n\t</td>\n</tr>\n-->\n<tr>\n\t<td><br>\n\t\t<h3>Corpus </h3>\n\t\tThe <a href=\"http://www.yelp.com/dataset_challenge\">Yelp dataset</a> released for the academic challenge contains information for 11,537 businesses. This dataset has 8,282 check-in sets, 43,873 users, 229,907 reviews for these businesses. For our study, since we are only interested in the restaurant data, we have considered out only those business that are categorized as food or restaurants. This reduced the number of business to around 5,000. <br> <br>\n\t\tWe selected all the reviews for these restaurants that had atleast one useful vote. From this pool of useful reviews, we randomly chose 10,000 reviews. A labeling codebook describing what categories to include was developed through an initial open coding of a random sample of 400 reviews. The codebook was validated and re\ufb01ned based on a second random sample of 200 reviews. This exercise helped us to fix 5 categories which include <i>food, ambience, service, deals, and worthiness</i>. Once we identified these 5 categories, the 10,000 reviews were divided into 5 bins with repitition in each bin. 6 Graduate student researchers from our group then read and annotated each of these reviews in the identified categories. <b>It took us approx. 225 man hours to annotate all the reviews.</b> We identified the conflicts in the annotation of reviews among different annotators. We removed all the reviews from the analysis where there were discrepancies among the annotators. This left us with 9019 reviews. We split these annotated reviews into 80% train and 20% test data. <br><br>\n\n\t\tThe review annotation process was very challenging and time consuming. We believe that it is one of the major contributions of this work. We plan to release the annotated data for researchers to extend the work. \n\t\t<!--The annotated train and the test dataset can be downloaded from here: <a href=\"./files/Train.csv\"> Train data</a> and <a href=\"./files/Test.csv\"> Test data</a> -->\n\t\t<br>\n\n\t</td>\n</tr>\n<td><br>\n\t<h3>Feature Extraction and Normalization</h3>\n\tWe extracted two types of features: (i) star ratings and (ii) textual features consisting of unigrams, bigrams and trigrams. <br>\n\tFor star ratings we created three binary features representing rating 1-2 stars, 3 stars, and 4-5 stars respectively. <br>\n\tFor extracting textual features, we \ufb01rst normalized the review text by converting it to lower\u2013case and removing the special characters. We did not remove the stop words as they play important role to understand user sentiments.\n\tThe cleaned text is then tokenized to collect unigrams (individual words) and calculate their frequencies across the entire corpus. \n\tThis results in 54,121 unique unigrams. We condense this feature set by only considering unigrams with a frequency greater than 300, which results in <a href=\"./files/unigrams.txt\">375 unigram features</a>. \n\tSimilarly we extract <a href=\"./files/bigrams.txt\">208 bigrams</a> and <a href=\"./files/trigrams.txt\">120 trigrams</a>.\n\t<br><br>\n\tThe arff files for the the features extracted for Train and Test data can be downloaded from here: <a href=\"./files/train.arff\"> Train.arff</a> and <a href=\"./files/test.arff\"> Test.arff</a>\n\t<br> <br>\n\tHere is a short video describing the corpus and the feature engineering <br> <br> \n\t<iframe class= \"media\" src=\"http://www.youtube.com/embed/0M-Mz-Uzogs\" frameborder=\"0\" allowfullscreen></iframe>\n\t<br>\n</td>\n</tr>\n<tr>\n\t<td><br>\n\t\t<h3>Classification</h3>\n\t\t<div>\n\t\tIn this section, we will describe the various approaches we took to build a classifier. We will reason about our choices based on the advantage and disadvantage of each approach. \n\t\tYou can also have a look at the video presentation, to get a quick overall idea<br> <br> \n\t\t<iframe class=\"media\" src=\"http://www.youtube.com/embed/L4hzuaHD8s4\" frameborder=\"0\" allowfullscreen></iframe>\n\t\t<br>\n\n\t\tThe problem of classifying a review into multiple categories is a not a simple binary classification problem. Since a reviewer can talk about various things in his or her review, each review can be classified into multiple categories. <br> <br>\n\t\t\n\t\tOne of the most popular and perhaps the simplest way to deal with multiple categories is to create a binary classifier for each category. So in our case, we create 5 binary classifiers for food, service, ambience, deals, and worthiness category. In order to do this, we need to transform the dataset into 5 different datasets where each dataset has information only about one category. <br> <br>\n\t\t\n\t\tTo understand consider a scaled down version of our dataset which has only 4 categories (food, ambience, service, and deals)\n\t\tAs shown in Figure 1., we create four different dataset from this original dataset such that each dataset is only associated with a specific category. For example, The new dataset created for \"food\" category will have only one label. The label will be '1' for all the datapoints which had '1' for \"food\" in the original dataset. Similarly the dataset created for \"service\" will have label as '1' for all the datapoints that had service as '1' in the original dataset. This is done for all the categories present in the dataset. \n\t\t\n\t\tGiven a new review, binary classifier for each category predicts if the review belongs to a category. The final prediction is the union of all the binary predictors. <br> <br>\n\t \n\t\t<div style=\"width:500;height:375;text-align:center;\"> <img width=\"500\" height=\"358\" src=\"./files/binaryrelevance.png\" align=\"middle\">Figure 1. </div>\n\t\t \n\t\t<br>\n\t\n\t\tOnce the dataset is transformed in 4 different datasets, any binary classifier like nearest neighbour, SVM, decision trees, etc. can be used with this approach. Although this approach is pretty simple and treats each category independently. However, as a consequence, it ignores correlation among categories. This assumption may not hold true, especially if the categories share some aspects with each other. For example, in our case, mostly when people get deals, they feel the restaurant is worthy to visit. This means that \"deals\" and \"worthiness\" categories are correlated. Similarly we saw correlation between \"service\" and \"ambience\" category. Sometimes the correlation may be high, and sometimes it may be very low. However, in our case, we thought it was worth accounting for. <br> <br>\n\t\t\n\t\tIn order to account for correlation among categories, we considered each different subset of L as a single category. Here L is the set of all the categories, a.k.a targets. So L = {Food, Service, Ambience, Deals}. For example, as shown in Figure 2, we transform the target for the Review 1, {Food, Deals} as a single target with value \"1001\". Similary for Review 2, {Ambience, Deals}, is tranformed into \"0011\". Here the pattern is formed by creating a vector which has fixed indices for each category. We then learn a multi-class classifier h : X --> P(L), where X is the review, P(L) is the powerset of L, containing all possible category subsets. This approach takes into account correlations between categories, but also suffers from the large number of category subsets. E.g., if we have 5 categories, this approach will generate 2<sup>5</sup> possible targets to predict; most of which will have only few datapoints to learn. This approach might work well if there is a large training dataset which covers all (at least most of) the possible targets to predict.\t\t\n\t\t<br> <br>\n\t\t<div style=\"width:500;height:375;text-align:center;\"> \n\t\t<img width=\"500\" height=\"358\" src=\"./files/classifier_each_subset.gif\" align=\"middle\">Figure 2. \n\t\t</div>\n\t\t<br>\n\t\tWe wanted to get best of both worlds i.e., consider correlation among categories and at the same time not get hit by the large number of subsets generated by the previous approach. Hence, we decided to use ensemble of classifiers where each classifier is trained using a different <b>small subset</b> (k) of categories. \n\t\tFor example, let's say there are 4 categories {Food, Service, Ambience, Deals}. We choose subset size = 2. Hence, we build a total of <sup>4</sup>C<sub>2</sub> = 6 classifiers for the following combination of categories: {(Food,Service), (Food,Ambience), (Food, Deals), (Service, Ambience), (Service, Deals), (Ambience, Deals)}. See first part of Figure 3. \n\t\tFor prediction, as shown in second part of Figure 3, we consider prediction of all the six classifiers and then take a majority vote. <br> \n\n\t\tThis approach considers correlations among categories and at the same time does not generate very large number of targets by considering only a small subset of categories for each classifier. <br> <br>\n\t\t\n\t\t<div style=\"width:950;height:375;text-align:center;\"> \n\t\t<img class=\"media\" src=\"./files/ensemble_subset_classifiers.gif\" align=\"middle\"> <img class=\"media\" src=\"./files/ensemble_classifiers_prediction.gif\" align=\"middle\">\n\t\tFigure 3. \n\t\t</div>\n\t\t\n\t</td>\n</tr>\n<tr>\n\t<td>\n\t\t<br>\n\t\t<h3>Experiments</h3>\n\t\t\n\t\t<h4>Evaluation metrics</h4>\n\t\tWe use Precision and Recall to measure the performance of a classifier. <br>\n\t\tTo understand, what precision and recall means in our context,\n\t\tconsider (x,Y) to be a datapoint where x is the review text and Y is the set of true categories. Y ⊆ L, where L = {Food, Service, Ambience, Deals, Worthiness}. <br>\n\t\tLet h be a classifier <br>\n\t\tLet Z = h(x) be the set of categories predicted by h for the datapoint(x, Y). Then, <br>\n\t\tPrecision = |Y ∩ Z|/|Z| (Out of the categories predicted, how many of the them are true categories) <br>\n\t\tRecall = |Y ∩ Z|/|Y| (Out of the total true categories, how many of them were predicted)<br>\n\n\t\t<h4>Results </h4>\n\t\t\n\t\tWe experimented with all the three approaches discussed above. We used Precision and Recall as our evaluation metrics. For each review, dThe comprehensive set of experiment configurations (different approaches, different classifiers, different feature sets, paramter settings) can be found in this<a href=\"https://docs.google.com/spreadsheet/ccc?key=0Ahz5yC6y5YLVdEg5blgyRU5wR2lyYVBJREJLMzlDYlE&usp=sharing\"> result sheet</a href>. <br> <br> \n\t\t\n\t\tIn the first approach of using L binary classifiers, where L is the total number of categories, we used Naive Bayes, k-Nearest Neighbour, Support Vector Machines (SMO implementation), decision trees, and Neural Networks. In the figure below we report results for Naive Bayes and K-NN for this approach as only they were competitive. <br>\n\t\t\n\t\tIn the second approach, where we consider label correlations and predict the powerset of labels. Decision trees performed the best in this category. We also experimented with ensemble of classifiers approach using decision trees that gave us the best results overall. <br> <br>\n\t\t\n\t\t<div class=\"left\"><img width=\"400\" height=\"300\" class=\"photo\" src=\"./files/resultstrain.png\">\n\t\t</div>\n\t\t<div class=\"left\">\n\t\t\t<img width=\"400\" height=\"300\" class=\"photo\" src=\"./files/resulttest.png\">\n\t\t</div>\n\t</td>\n</tr>\n\n\n<tr>\n\t<td><br>\n\t\t<h3> Conclusion </h3>\n\t\tYelp\t reviews\t and\t ratings\t are\t important\t source\t of\t information\t to\t make\t informed\ndecisions\tabout\ta\tvenue.\tWe\tconjecture\t that\t further\tclassification\tof\tyelp\treviews\tinto\nrelevant\tcategories\tcan\thelp\tusers\tto\tmake\tan\tinformed\tdecision\tbased\ton\ttheir\tpersonal\npreferences\tfor\tcategories.\tMoreover,\tthis\taspect\tis\tespecially\tuseful\twhen\tusers\tdo\tnot\nhave\t time\t to\t read\t many\t reviews\t to\t infer\t the\t popularity\t of\t venues\t across\t these\ncategories.\nIn\t this\t paper,\t we\t demonstrated\t how\t reviews\t for\t restaurants\t can\t be\t automatically\nclassified\t into\t five\t relevant\t categories\t with\t precision\t and\t recall\t of\t 0.72 and\t 0.71\nrespectively.\t We\t found\tthat\t an ensemble\t of\t two\t multi-label\t classification\t technique\n(Binary\t Relevance\t and\t Label\t Powerset)\t performed\t better\t than\t the\t techniques\nindividually.\tMoreover,\t there\tis\t no\t significant\t difference\tin\t performance\twhen\t using\ta\ncombination\t of\t bigrams,\t unigrams\t and\t trigrams\t instead\t of\t only\t unigrams.\t We\t also\nshowed\thow\tthe\tresults\tof\tthis\tstudy\tcan\tbe\tincorporated\tinto\tYelp\u2019s\texisting\twebsite.\n\n\t</td>\n</tr>\n<tr>\n\t<td><br>\n\t\t<h3> Technical Report </h3>\n\t\t<a href=\"./files/technical_report.pdf\">Multilabel Calssification of reviews in Yelp data</a>\n\n\t</td>\n</tr>\n<tr>\n\t<td><br>\n\t\t<h3> Download Presentation </h3>\n\t\t<a href=\"./files/Yelp_Review_Classification.pptx\">Multilabel Calssification of reviews in Yelp data</a>\n\n\t</td>\n</tr>\n\n<tr>\n\t<td><br>\n\t\t<!--\n\t\t<h3> References </h3>\n\t\t[1] Zhang, M.-L., and Zhou, Z.-H. 2007. Ml-knn: A lazy learning approach to multi-label learning, pattern recognition. Pattern Recognition 40(7):2038\u20132048.<br>\n\t\t[2] Tsoumakas, G.; Vilcek, J.; Spyromitros, E.; and Vlahavas, I. 2010. Mulan: A java library for multi-label learning. Journal of Machine Learning Research.<br>\n\t\t[3]Tsoumakas, G., Katakis, I., Vlahavas, I. (2010) \"Mining Multi-label Data\", Data Mining and Knowledge Discovery Handbook, O. Maimon, L. Rokach (Ed.), Springer, 2nd edition, 2010.<br>\n\t\t-->\n\t</td>\n</tr>\n</tbody>\n\n\n</table>\n <div id=\"disqus_thread\"></div>\n <script type=\"text/javascript\">\n /* * * CONFIGURATION VARIABLES: EDIT BEFORE PASTING INTO YOUR WEBPAGE * * */\n var disqus_shortname = 'yelpdatasetchallenge'; // required: replace example with your forum shortname\n\n /* * * DON'T EDIT BELOW THIS LINE * * */\n (function() {\n var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;\n dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';\n (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);\n })();\n </script>\n <noscript>Please enable JavaScript to view the <a href=\"http://disqus.com/?ref_noscript\">comments powered by Disqus.</a></noscript>\n <a href=\"http://disqus.com\" class=\"dsq-brlink\">comments powered by <span class=\"logo-disqus\">Disqus</span></a>\n\n</body></html>\n", "encoding": "utf-8"} |