Cheap and Secure Web Hosting Provider : See Now

Language Classification + AWS ML: what am I doing wrong?

, , No Comments
Problem Detail: 

I'm evaluating Amazon's machine learning platform, and thought that I would give it a "simple" classification problem first. As a disclaimer, I am quite new to machine learning (hence my interest in an ML platform).

The classification problem is language detection. Given a list of 20k words, and their language (English, French, or Random), train a model to classify new words.

My data is structured in CSV format, with 2 rows:

dàagzj, random tunisia, english craindre, french voters, english religions, english condition, french ... 

I imported the data successfully into the platform, and all seems fine. enter image description here

When I attempt to run train a model (using both the default settings and tweaking them) I get the same result. English is selected as the language nearly 100% of the time. enter image description here

I know this problem is possible to get reasonably accurate results with simple neural networks, however I'm not sure what is going wrong?

Do I need to perform any preprocessing operations on the text input, or is the plain string sufficient? What data can be collected about a single word that may be a more effective input to a machine learning model?

Asked By : David Ferris
Answered By : D.W.

There are probably multiple things wrong here.

Features

First, you don't tell us what features you have provided. If the only input you have provided is the word itself (e.g., the string craindre), then most likely the machine learning algorithm has no idea how to use that information: all it knows is that this string is different from all the others you've provided, so it has no ability to generalize.

So, you need to derive some suitable features, to enable this to generalize. For instance, maybe you'll have a feature vector of length $26^2=676$, one for each possible pair of letters, counting the number of times that pair of letters appears consecutively in the word. Then, instead of asking the machine learning algorithm to predict the language from the word itself, ask it to predict the language from the feature vector.

You can certainly come up with much more sophisticated and effective features; this is just an example. For instance, maybe you might have a feature that counts the fraction of letters that were vowels, or the fraction of letters that were consonants, or a feature that indicates what the last letter was, or a feature that indicates how many consecutive vowel-pairs the word had. Basically, you want to pick features values that you think might be helpful at predicting the language or might tend to take different values for different languages. If you want to build something that will be very accurate at predicting language, you'll probably need to get fairly sophisticated in designing features. There's lots of research literature on language detection; you could try reading it to get some ideas for better features. But if you just want to play around with this, my advice is to start with a small number of simple features.

Typically, machine learning algorithms are very smart about finding patterns in the feature vectors you provide -- but they don't do anything to automatically select features. So, the division of labor is: you come up with the features; the ML algorithm looks for patterns that enable it to use those features to predict the answer.

Class imbalance

I suggest you read about "the class imbalance problem". You have 3 classes (English, French, and random). Your training set is imbalanced: you have twice as many examples with English as French/random. In other words, you don't have an equal number of examples for each class.

This is not necessarily a problem, but in some cases it can be. For instance, some machine learning algorithms behave poorly when the training set is imbalanced. And every machine learning algorithm will work sub-optimally when the frequency of the classes in the training set doesn't match what you expect to see in practice.

In your case, it's probably not any of those complicated explanations; it's probably a much simpler reason. Because you haven't given the ML algorithm any useful features, it basically has no information whatsoever that it can use to generalize or spot patterns. In other words, it's trying to guess the language of a given word given no information about that word. So what is the poor ML algorithm to do? All it can do is guess based on the relative frequency of different languages. In your case, your training set has told it that any given word is more likely to be English than to be French or random (about 2x as likely), so if you're forced to make a guess and pick one language, the smartest one to pick is 'English'.

In conclusion: it is no surprise that the ML algorithm is always outputting 'English', in your situation.

Once you add a suitable feature vector, if it still works poorly, you can read about the class imbalance problem and see whether it applies to you.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/66475

3200 people like this

 Download Related Notes/Documents

0 comments:

Post a Comment

Let us know your responses and feedback