Cheap and Secure Web Hosting Provider : See Now

# Classification problem where one attribute is a vector

, ,
Problem Detail:

Hello I am a layman trying to analyze game data from League of Legends, specifically looking at predicting the win rate for a given champion given an item build.

### Outline

A player can own up to 6 items at the end of a game. They could have purchased these items in different orders or adjusted their inventory position during the course of the game.

In this fashion the dataset may contain the following rows with:

``   champion id   |                 items ids               | win(1)/loss(0) ----------------------------------------------------------------------------        45        |   [3089, 3135, 3151, 3157, 3165, 3285]  |       1        45        |   [3151, 3285, 3135, 3089, 3157, 3165]  |       1        45        |   [3165, 3285, 3089, 3135, 3157, 3151]  |       0 ``

While the items are in a different order the build is the same, my initial thought would be to simply multiply the item ids as this would give me an integer value representing that combination of 6 items.

While there are hundreds of items, in reality a champion draws off a small subset (~20) of those to form the core (3 items) of their build. A game may also finish before players have had time to purchase 6 items:

``                items ids                ------------------------------------------    [3089, XXXX, 3151, 3285, 3165, 0000]    [XXXX, 3285, XXXX, 3165, 3151, 0000]    [3165, 3285, 3089, XXXX, 0000, 0000]  XXXX item from outside core subset 0000 empty inventory slot ``

As item 3089 compliments champion 45 core builds that have item 3089 have a higher win rate than core builds which are missing item 3089.

The size of the data set available for each champion varies between 10000 and 100000. The mean is probably around 35000.

### Questions

1. Is this a suitable problem for supervised classification?
2. How should I approach finding groups of core items and their win rates?

Yes. If you have a non-trivial data set of this form, this would be a reasonable fit for statistical analysis.

For independent variables, you have one binary feature for each core item (indicating whether that item was purchased or not); the outcome is a binary variable (win or loss). Accordingly, one reasonable approach would be to try logistic regression. You'll have one independent variable \$X_i\$ for each item; \$X_i=1\$ means that item \$i\$ is one of the 6 items that the champion purchased in this game, \$X_i=0\$ means item \$i\$ was not purchased. You'll have a dependent variable \$Y\$; \$Y=1\$ means that the champion won this game, \$Y=0\$ means the champion lost. Then logistic regression will tell you which items tend to be associated with winning games.

There are other methods you could try as well: pretty much any method for supervised classification that works well with binary/categorical variables.

The one thing I don't recommend you do is multiply the item codes. That's not going to help. Instead, just have 20 features, where each feature indicates whether a particular item was purchased or wasn't.