Share on FacebookTweet about this on TwitterPin on PinterestShare on Google+

 

Starting today, we kick off the Uhuru Data Lab (UDL) blog with this first post. After some exciting data crunching, we present our approach to the recently concluded competition Coupon Purchase Prediction by leading coupon site, Recruit ポンパレ (Ponpare).

The competition asked teams to use 1 year of past purchase and browsing data of 22,000 users of Ponpare to predict which coupons a user will purchase over the next week. We present a simple gradient boosted decision tree model which gives a score of 0.0072 on the Leaderboard.

recruit_image


Overview, cleaning the data and evaluation metric:

In any real dataset cleaning is the first step. Understand the files through Entity Relationship Diagrams, make basic statistics and study basic characteristics.

 

Understanding the data :

In this case we had coupon purchase and view logs of 22873 users from 01-07-2011 to 23-06-2012.  In total they viewed 32628 coupons on their website in this period. But of these there are only 19413 coupons whose attributes like price rate, discount price, usable dates, shop location etc have been provided. We consider these as the training set coupons. Prediction is to be made over 310 new coupons over the next week ( 24-06-2012 to 30-06-2012 ) whose attributes have also been provided. The challenge is to predict: which of these 310 coupons will be purchased by the 22873 users? Past browsing data of users contains attributes like view date, user id, purchase flag, session id, viewed coupon id etc. There is also data about user age, gender, prefecture and registration/withdrawal date.

 

Files Summary

Figure 1. A summary of data files and the containing attributes.

It seems it is possible to purchase coupons without any views. We assumed that each of those purchases has 1 view at exactly the same date and time the purchase occurred.

 

The map@10 metric:

The evaluation metric employed was Mean Average Precision@10. A good explanation of this metric can be found here. W.r.t this metric, for each user only the top 10 coupon predictions matter. Also, the order of predictions matters. We observed from the data that most of the coupon purchasers in a week only view ≤ 10 coupons in that week.

 

Blog Diagram-2

Figure 2. Week Number 52 refers to the most recent week. Week Number 1 refers to the first week of the visit data.

Obviously the number of viewers and views are much more than the number of purchasers and purchases. For obtaining a balanced training data/reducing sparsity (as we explain later) we changed the problem statement and predicted just views instead of purchases. Interestingly, Leaderboard score obtained was sufficient to get into top 40.

 


The ideas and the XGBoost model:

The problem is essentially that of cold-start item recommendation with user and item attributes both present. User attributes can be extracted from his/her past  weblog. Coupon attributes are given.

 

XGBoost GBDT model:

The XGBoost gradient boosted decision tree model was used.

  1. Training set: For the previous week, make pairs of viewers (of coupons in that week) and coupons (released in that week).
  2. The target variable is 1 if a user-coupon pair corresponds to a view else 0.
  3. To balance the training set, randomly select 13 coupons for each viewer and include all coupons with target 1. This gives an approx. 30:70 ratio of target variable.
  4. Train using AUC metric and make predictions on all the user-coupon pairs of the next week. For each user consider only 10 coupons with highest predicted probability as the recommended coupons.
  5. (optional) Make several training datasets in step 3, train and average predictions in step 4 (We made 17 such models).

Just this model gives map@10 of ~0.00072 on the Private Leaderboard. Cross validation was performed by training on previous week’s viewer-coupon pairs and predicting on next week’s user-coupon pairs.

 

Other ideas (BPRMFkNN model):

Another approach we tried was matrix factorization. Specifically, BPRMF with k-NN mapping as explained in this paper (Learning Attribute-to-Feature Mappings for Cold-Start Recommendations). The idea is to make a user-coupon matrix for all the users and all the coupons (whose attributes are given) released before the current week.. The matrix contains 1 if a purchase occurred, else 0. The matrix sparsity is around 99.96%.

We used the open-source recommender system library MyMediaLite. It already has BPRMF implemented for item prediction.

  1. Factorizing with 160 factors seemed to perform well.
  2. Use k-NN mapping (k=8 seems to perform best) to map new coupons from their attribute-space to 160-factor-space.
  3. We use only binary 0-1 attributes for coupons and a cosine distance as the similarity function in k-NN mapping.

This models gave a slightly lower performance than the above XGBoost model (~0.00068 on Private Leaderboard). Ensembling didn’t seem to help significantly.

 


Features construction:

The main focus was on the features used in the XGBoost model. We categorized features into 3 categories: coupon-features, user-features and user-coupon-pair-features. The user-features were further categorized into global-user-features (independent of user weblog) and week-dependent-user-features (derived from user weblog prior to the given week).

 

Coupon-features:

Blog Diagram-3

Figure 3. Coupon-features and their values.

 

User-features:

Blog Diagram-4

Figure 4. User-features can be categorized into global and week-dependent.

 

User-coupon-pair-features:

These are features which measure affinity between the user-coupon pair with respect to various attributes like capsule text, genre name, shop large area name, catalog price, usable date etc. Affinity is calculated using user’s past weblog. For each user, a score between 0 and 1 is calculated w.r.t to each of the above one-hot encoded coupon attributes. Affinity measures also make use of page_serial, session_id and purchase_flag given in the data. These are the most important features in the model.

Blog Diagram-5

Figure 5. User-coupon-pair features. All values are between 0 and 1 and several affinity measures were used.

 


Code, parameters and (not tried) approaches:

Everything was carried out on Amazon EC2 r3.8xlarge instance using RStudio server. Affinity features generation especially take time and of course model training. This repository contains the complete reproducible code in R.

Other approaches can be experimented with. For example, to train XGBoost model with purchases instead of views (or even try other classification models like random forests, kNN etc). Or use other matrix factorization methods like BPRLinear , WRMF from MyMediaLite package.

 

  1. >> To balance the training set, randomly select 13 coupons for each viewer and include all coupons with target 1. This gives an approx. 30:70 ratio of target variable.

    Do you mean that you randomly selected any 13 coupons irrespective of the fact, whether the user viewed the coupon or not?

    • Yes you are right. So for each user, randomly select 13 coupons irrespective of whether the target is 0 or 1. Then add any remaining coupons with target 1.

  2. Hi Aakansh,

    Thanks for the analysis! Very inspiring.

    Several questions:

    1. In User features,
    What do is_withdrawn_beforeweekstart and is_registered_afterweekstart mean? Do the week mean the week that the coupon has released?
    2. In User-Coupon pair features,
    How do you measure the affinity between the text based attributes, aka. Genre, Capsule Text, and Area Name?
    Did you considered the correlation between areas which are neighbors and which are not?
    3. By modeling users & coupons per week, why do you only consider the week the coupon is released?
    Since from the weblog, there are many purchased happened 1 week after coupon’s DispFrom:

    PurchaseDelay PurchasedCount
    7 1975
    8 533
    9 276
    10 203
    11 191
    12 159
    13 177
    14 113
    15 2
    16 14
    17 23
    18 27
    19 20
    22 1

    Thanks,
    Sunny

    • Hi Sunny,

      Answering in reverse order.

      1. An example. If we need to predict purchases for week w: Then take week (w-1), find all the coupons with DispFrom in that week. Also take all the users who view those coupons in week (w-1). Make training data and predict on user-coupon pairs made corresponding to week w.(When we are training , we act like no information is known about week w. So omitting purchases happened after week w-1 is important while training).
      2.  

      3. I did not use any information related to geographical coordinates (though that can be very useful). An example again. Suppose the text-attribute ‘Genre’ can take 3 different values: A, B, C. Then for each user find the list of coupons he/she viewed and the corresponding values of ‘Genre’. Say, we obtain the list {A,B,B,A,C,A,C,A,A}. Then count the occurrence of each of A, B, C and normalize by the total list size. So user affinity for A becomes 4/9 and for B,C it becomes 2/9. Now if a coupon had ‘Genre’ B then user affinity for that coupon would be 2/9. Other variations of affinities can be calculated by giving weights like session_id length or page_serial in above calculation. [See the Github code for more details].
      4.  

      5. is_withdrawn_beforeweekstart means if the user withdrew ponpare registration before that week. is_registered_afterweekstart means if the user registered on ponpare during that week. (The second represents a small data leak. Because a very small number of user-registration-dates given in the data lie in the test-week 24-06-2012 to 30-06-2012. These users will thus register in the future but we already know about them. But absence of any other significant info about them makes this leak useless).

      Thanks.

      • Hi Aakansh,

        Thanks for the detailed answers!

        So basically you used only 2 adjacent weeks, 1 for training and 1 for testing. This brings more questions:
        1. Did you train (a. 1 model for 1 week) or (b. 1 model for all period regardless of which week the data belongs to)?
        2. When predicting the week of the test-week, which model mentioned above did you use?
        3. What made you to derive training and test data for 2 adjacent weeks? Is it to eliminate the effect of time series?
        4. You mentioned “we changed the problem statement and predicted just views instead of purchases”. How could this work given that the majority of views don’t result in any purchase?

        Cheers,
        Sunny

        • Trained 1 model on coupons released in week (w-1) and predict for coupons released in next week w. That’s it. For submission, w = 53; for local validation you can use w = 52, 51, 50, …

          Only two adjacent weeks because user behaviour changes over time and I hypothesised it is similar in two adjacent weeks. The more we go away from week w, the more user behaviour is likely to be different.

          Predicting just views works because the evaluation metric is map@10. So for users who do not purchase even after viewing, we can submit any 10 coupons and it won’t change map@10. For users who view and purchase we hypothesise user-coupon pairs corresponding to purchases have higher probability than user-coupon pairs corresponding to non-purchase-views. (This works because of the affinity features some of which were weighed on the purchase_flag also).

          • Gotcha.
            The use of traits of map@10 is really brilliant!

            So you validated the model on 50+ weeks, 1 model for 1 week? That’s tons of effort :(

  3. Hi Aakansh,
    1) Affinity of a coupon feature = % of views by the user by each class of a coupon feature right? Is that correct? Are these the most important features in your model?
    2) How did you treat repeat purchases (duplicated user-coupon pair) in the same week coupon is launched? You removed duplicates or retained them in training dataset? Why?
    3) How did you tune the hyper paramters of the xgboost model? I wrote a script for randomized grid-search cv. Is this the best option? Typically how do you approach

    Thanks
    Deepak

  4. 4. Can you also explain how you calculated waits for Affinity using session_id length /page_serial? session_id length /page_serial info is not available during the test week right?

    • Hi Deepak,

      1. Yes you are right. Yes, these are the most important features in the model.
      2.  

      3. Removed the duplicates. Though it would be interesting to see the affect on results if they are included.
      4.  

      5. Training takes a lot of time so I did not try full grid search. Typically accuracy increases with decreasing eta. So I decide an eta (0.05), subsample (0.9) and colsample (0.5) and try different depths. After finding the best depth, I tune colsample in the direction accuracy is improving. Then try to perturb depth again and see if accuracy still improves. Finally I play with subsample (0.8,0.9,1.0).
      6.  

      7. Affinity for an attribute is a user’s property. So it is derived from his/her past log which includes all the coupons seen until the start of that week. Those log records do have session_id and page_serial. To calculate a user’s affinity w.r.t say ‘Genre’ and weighing on session_id, we split that user’s data by session_id. For each split, calculate a vector of (%views of each genre). Take the mean of these vectors.

        If you look at page_serial, it increases monotonously with time given a fixed session_id. If in a given split page_serial’s are {1,3,4,9}; the user is likely to have less affinity for the coupon viewed at page_serial 3 than the coupon viewed at page_serial 9. Why? Because it took 3-1=2 pages to arrive in the former case and 9-4=5 pages to arrive in the latter. He may have done prefecture search, then small_area search, then genre search etc and then arrived at the coupon in the latter case (‘frantically searching’ for that coupon!). So the consecutive differences matter i.e. {NA(take 1 instead),2,1,5}. Call them ‘page_depths’. Thus, for each split we increase the multiplicity of genres by a factor of the corresponding page_depth and then calculate (%views of each genre). This was in fact one of top 3 features.

      8. Hope it makes sense.

        Thanks.

Leave a Reply

メールアドレスが公開されることはありません。 * が付いている欄は必須項目です

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">