**Problem Detail:**

Let's say I have a database of freight orders. The job is to match freight carriers with customers who need their freight moved. I have the customer's information, the freight carrier's information, and all details related to the freight orders including date ordered, date shipped, the amount of money it took to hire the freight carrier, and whether a carrier was even found to ship the order.

If I have thousands of these past freight orders, **could I use machine learning to look at future freight orders to predict whether or not a freight carrier will be found to move it?**

Bonus: If it is possible, what steps would I need to take to find the best data points to focus on? From what I understand, I need to convert everything to a number in order to train the classifier, but I am having trouble figuring out what data features are going to help make these types of predictions.

I have been studying how to do machine learning and I am not looking for somebody to tell me everything there is to know on the subject, I just don't know how to determine what data points are going to be useful and am also looking for an answer to whether or not this is something machine learning can do(or if it's something a beginner in machine learning can do). Sorry if the question seems vague, it's kind of hard to articulate on a subject you are just starting to learn about. If anybody has materials they can link that would help me to better understand these things, I would appreciate that as well..

###### Asked By : Duck Puncher

#### Answered By : jmite

Here's the thing to know: machine learning assumes that there is some sort of statistical distribution, and you "learn" that distribution, in order to get the probability of some event.

The common saying in machine learning is "garbage in, garbage out." You have a bunch of random variables, but if those variables are all statistically independent, then you won't get anything useful from machine learning.

Say, for example, that there was a strong correlation between the amount of money paid for the carrier, and whether the item is shipped. Machine learning would likely be able to discover this relationship.

Or, if there were certain periods of the year where an item was more or less likely to be shipped. Machine learning could find this out.

But, if there's no underlying pattern in the data you give to your training algorithm, then you will get no useful information out of it.

For your question of finding the best data points: don't. That's what the machine learning algorithm is for. You give it your data, and it looks at which ones are statistically relevant, given some threshold. The whole point of machine learning is letting the algorithm do that for you, and more importantly, that's all the algorithm can really do for you.

Generally, these algorithms will work better when they have more data, so don't go out of your way to remove what data you're giving it.

And, remember, that ALL these algorithms are doing is statistics. If it says that the item is shipped with probability 0.6, don't be surprised when it isn't shipped. And, it's possible that it will say the item is shipped with probability 0.99, but it won't ship, because there's some variable which is hugely important in real life, which you didn't have recorded for your data set. If that variable isn't correlated with your data, then your model will be no good.

###### Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/47239

**3.2K people like this**

## 0 comments:

## Post a Comment

Let us know your responses and feedback