Six of the Best Algorithms Explained

In this long article, I am going to share and somewhat explain to you about six of the best algorithms developed in the data mining world.

And of course the data that I am going to share in this content is back-up by different panels.

After knowing the list that I am going to provide, I am sure that you will have an idea on how these algorithm work and what are the things to be done in order to find them.

My only hope after this is that, you will make this helpful post as one of your springboard (if any) for you to understand and learn deeper on what data mining is all about.

So let us now cut the introduction short and get started. Here are the six of the best algorithms:

1. The AdaBoost

What does this do?

From the name itself, AdaBoost is basically an algorithm boosting that construct a classifier. If you recall when we talk about classifier, it usually takes bunch of information and tries to predict the classification of the set of data.

So what is boosting? It refers to an ensemble type of algorithm learning that can take multiple algorithm in learn and effectively combines all of them. The goal here is to ensemble the weak learners and combines it in order to create a strong single one.

Now the difference between weak and strong learner, a learner that is weak classifies with accuracy above chance. The best example for this one is decision stump that is one step ahead on a decision tree.

A strong learn has higher accuracy, and it is used often, one example of it is SVM.

We will share to you a concrete example for AdaBoost. Now, assuming that we have three weak learners, we are then using them in a ten round dataset training that contains a patient data.

Of course the dataset has data on the medical records of the patients. Now the big question is how to effectively predict if a certain patient in the future will have cancer.

Here is how AdaBoost respond that query.

Round 1: AdaBoost will take the sample of the dataset and process attesting to determine the accurateness of the learner. The main goal for this is to determine the best information or learner.

Round 2: AdaBoost repeatedly attempts to assess for the best item or learner. And now here is the kicker, similar to the training data of the patient, the data is now predisposed by heavily weights that are misclassified, and also those who are previously misclassified patient will have higher chances of being chosen as sample.

It is similar to video game wherein you want to move to the next level by not starting at the first level since you are killed. Instead you immediately start at the second level and focus your energy to move to level 3 and so on.

In the same manner, the first batches of classified learner are already set. So instead of just classifying them again, the focus now is on the misclassified patients. The learners that were tagged as beset will be incorporated in the ensemble.

The misclassified patients are then weighted so they will have a chance of being picked as the system repeat and rinse the process.

Therefore at the end of the ten rounds, the process will weight learners and repeatedly reprocess those who are misclassified until such time that data that failed to be classified on the previous round will be picked.

Now is this an unsupervised or supervised learning? AdaBoost is a supervised type of learning, since the iteration will train the weaker learner identified in the dataset. In addition to that, AdaBoost is amazingly fast and simpler, meaning it can execute process much faster. It is also elegant for auto-tuning your classifier, since running successive AdaBoost round will refine weight to classify learners.

Our final word, Adaboost is amazingly versatile yet flexible. It can incorporate many learning algorithm and can work in large array of dataset.

2. The K-Means

What does this do?

To discuss further, k-means creation of k groups from the set of collected objects. Therefore with the classification of objects, we can picture out that the members of the group are obviously similar. It is also a popular cluster analysis method for exploring dataset.

Just hang on. We know what is on your mind. What is a cluster analysis? When we talk about cluster analysis, this is merely an algorithm specially designed to function as forming groups in which the members of the group are similar compared to the non group members(we believed the definition is clear then).

Both groups and clusters are indentified as synonymous in cluster analysis world.

Do these have example? Without doubt, let us assume that we have patient’s dataset. In the cluster analysis, it is categorized as observation.

Of course we know many things about the patients like the age, blood pressure, pulse, cholesterol, VO2max, etc. we identified these items as “attributes” and these are the vectors that clearly represent each of the patient.

Let us look deeper, we can think basically of vector as list of items that will describe a patient. The said list can be interpreted like coordinates in a multi-dimensional hole. Take note, a patient’s pulse can be a single dimension, the blood pressure on the other hand can represent another unique dimension and this will continue until such data will be interpreted.

You might wonder that given with these attributes, how we are going to cluster them all together in similarity like age, the pulse, the blood pressure, etc.

The best part is that, K-means the cluster you want, and k-means take cares the rest. But how k-means take cares the rest? K-means has many variations in order to effectively optimize certain types of processed data.

Basing on high level, they do same things. For k-means it picks up the important points in the multi dimensional area in order to represent these points in k clusters. These are also labeled as “centroids”.

Every single patient that is closes to the k centroids and not all of them will be having the same points as a close to the centroids. They simply form a cluster those who are within the range of the centroids.

The idea we have above now points out about the k clusters and every patient according to range becomes a member of the created cluster. With k-means it then finds the center for every created k clusters based on cluster members and of course using patient’s attributes.

After getting the center point, it then becomes the newly created centroid for every cluster. Therefore the centroid will become a different place right now. The patients who are much closer to other created centroids will have the option to change its cluster membership.

The steps two and three are then repeated until such time these centroids becomes stable, like k membership stabilize. When that thing happens that is now called “convergence”.

Now is the vector unsupervised or supervised?

These actually depend, but k-means usually classified as k-means. Other than just specifying the clusters, when it talks about k-means it talks about learning the cluster automatically based on the guidelines without any data in which will address the clustering.

Why do we need to use k-means?

But we believed that this will not be an issue. The key point of what really k-means mean is simplicity. When we talk about simplicity, it means faster and efficient compared to other algorithms in handling large data sets.

In addition, k-means can also handle massive pre clustered dataset followed by expensive cluster data analysis on sub clusters.

With k-means, it can also be used to rapidly process a k set and explore whether there are missed patterns or dataset relations. But, with all the strong points that we have said, we also found out two weaknesses for k-means. This is about sensitivity to the outliers and sensitivity to centroid choices.

But to leave you a final idea of what k-means really mean is that it is specially designed to process a continues dataset, you can somewhat do and work with some tricks in order to work effectively with discrete data.

3. The C4.5

We will start with a simple question.

What does C4.5 do?

To answer the starting question, C4.5 is identified as classifier, meaning it construct a form in order that will be directed to the decision tree.

But before this will be done, C4.5 has set of special data that represents information that are considered classified in other words it has a pre-packed arguments in it.

Oops before we forgot, do you know what classifier is?

Because if you don’t, I have to include it here for you to understand the things that I will be saying, when we say classifier, this is somewhat equivalent to data mining tool that takes bunch of information that represent things and will be classified and a system that attempts to foretell it into which the data belongs.

To understand deeper, let us have an example with C4.5. Suppose a set of data has bouquet of patients. And basically when it talks about patient, this normally deals with the age, blood pressure, pulse, family history, VO2max, etc. these items are what we called “attributes”.

Now, these so called attributes and let us say our aim is to simply predict if a certain patient will have cancer. The option that we have produces two direct classes, in which option one the patient gets cancer” and option two the patient won’t get a cancer. C4.5 will now told the classification of each of the patient.

Here is our clearer deal, by simply using the attributes of the patient and their corresponding classes, the C4.5 then construct a unique decision tree that has the capability to predict the possible class of the patient based from the given attributes.

Sounds cool right? But, wait a minute.

What is a decision tree about?

Decision tree is something that creates a similar idea to what a flowchart is in classifying data. By simply using similar patient example, to be particular, a created flowchart could somewhat can base some info on cancer history of a patient.

With this attribute, it has the possibility that the patient has high gene on cancer related illness. Another thing is that when a patient has tumor, when his/her tumor is greater than the range size.

The idea here is that, every item in the flowchart has question about the attribute value and of course it will depend on the value, so that an item will be classified. There are many examples of a decision tree. That you can search on.

Now there goes the unsupervised and supervised category. But let’s start with the supervised learning. The data set is labeled based on classes.

The patient data as example, C4.5 does not simply learn directly and say that the patient will then get cancer in the long run. We will tell it first, and then a generated unique decision tree will now be attached in order to classify its decisions.

So now, you might be confused or wondering how this C4.5 distinguished from decision tree platforms. With that we have here some things to consider.

First, a C4.5 uses of the information that is gain when the system is generating a decision tree, secondly, though other platforms incorporate pruning, with C4.5 it only uses a unique single pass process in order to mitigate the over fitting, so that the pruning results will have its improvements. The third one is that, C4.5 can both work with continues and a discrete data.

Our understanding with it is by simply specifying a certain ranges or a threshold for continues data thus in the long run turning this continuous data into a discrete data. And finally, the data that are incomplete will be dealt with the system ways.

Now then, why do we need to use C4.5 then? Certainly, this one is the best point of decision tree ever generated, considering the ease of the interpretation and the wide range of explanation it offers.

Not just that, it is also super fast, popular and of course the generated output is readable to human.

The next question, where will be used?

Java implementation is a popular open source that can be found at OpenTox. And also Orange another open source for data visualization and data mining analysis tool that can implement the C4.5 in tree classifier decision.

4. The Apriori

This one is my favorite, what does this do?

An Apriori type of algorithm learns an association rules and it is applied to a certain database that deals with large transactions in numbers. What is an association rule? When we talk about Association rule of learning, this is learning data mining methods for a learning relations and correlations among the variables present in the database.

The best example that we can share about Apriori is a database that has full supermarket transactions. Basically this is like super giant spreadsheets wherein each of the rows is tagged as customer with all of its transactions and the column of the sheet is about the items in the grocery.

Again here is the best part of it, by applying Apriori algorithm, the capability to learn immediately on a certain grocery items that are purchased all together associated with the rules.

We have now the power to find items that are purchased all together frequently in all of the items, the ultimate goal here is to determine to which items shoppers buys more. And these are called the “item sets”.

For instance, in a grocery store you can see that dip + Chips + soda are in the same place together. These are what we called two item sets.

With large dataset, if becomes harder to determine which relationships that you will have to deal, especially when it is three item sets or even more.

So this is where Apriori comes in. And if you still wonder how this Apriori work, we need to clearly define three things so that you will have an easy understanding on it.

  1. Item set, this is the first, as you aim to determine the patterns for two item set, three item set, four item set, etc.
  2. Support”, this is the second, it pertains to the transactions that contains an item set divided by over all transactions. The item set will meet the special support called “frequent item set.
  3. Confidence”, this is the last one, also known as conditional probability to the items given or an item on the item set. The best example for this is the chips in the item set, meaning 67% confidence of placing soda near the item set.

There are three steps approach in basic Apriori algorithm:

  1. Join – it scans the overall database checking the frequency of the item in an item set.
  2. Prune – this are the item sets that fully satisfy the confidence and support in order to move for the next item set.
  3. Repeat – as the word says, it simply repeats its item set level until such time that the defined size that was previously created will be reach.

Again, is this unsupervised or supervised?

Generally, Apriori is considered as one of the unsupervised learning approach, why? Since the algorithm function is designed to discover or mine interesting relationships or patterns in the dataset.

Wait! We are not yet done here. Apriori can be modified also for it to do some classification based data label. Why should I use Apriori? To tell you Apriori is easy to understand and very easy to implement as it has number of derivatives. Therefore, the algorithm on Apriori needs more memory, time and space intensive when generating the item sets.

Where is Apriori used?

There are plenty of them using Apriori. The one that are popular are Orange, ARtool and Weka.

5. The Vector Support Machines

What does SVM do?

SVM (Support Vector Machine) learns in a hyper-plane way of classifying set of data in two classes.

When we say hyper, it simply points out of being high level category, the support vector machine performs similar task to what C4.5 does, except that SVM does not use a decision tree at all.

Wow, it talks about hyper. And the question here is that, to what? When we talk about hyper-plane, this is like having a line equation like, y = mx + b. and for easier classification it only have two basic features. So therefore a hyper-plane can be a considered a line or a line itself.

SVM can also perform a certain trick to a data project into a higher dimension, once it is projected to a higher level of dimension, support vector machine figures out the best hyper-plan that will effectively separate a data into two different classes.

To have a clearer understanding to how SVM work, we will have here an example. The best example that we can say is about a set of blue and red ball on the table. If these balls are not that mixed together, you could simply use a stick to move the ball.

Therefore, when a ball is being added, by simply determining the classification of the ball is by basing on its color and for you to use the stick to place it on the right side.

The red balls, blue balls and the table represent two classes. The stick then now represents as the hyper-plane which is obviously a line.

Again, here is the coolest part. Support vector machine figures the function for hyper-plane. Now what if things get complicated? The scenario like this frequently happens. If the red and blue balls are mixed the straight stick which it used to work.

So here how it will work, immediately it lift up the table throwing it in the air, after that while the balls are flying in the air as it is thrown up, a large sheet will be used in order to divide these balls when it is in the air.

So with that scenario, you might think that the method is cheating. No it’s not. When a table is lift up it is similar or equivalent to dataset mapping moving it into a higher dimensions. In our case, we are now going to the second dimension surface to three dimensional surfaces which is when the ball is in the air.

Now, how do these balls in the air or in the table maps out in the actual application of the data. The ball on the table has its location indentified as coordinates. Let us say, a single ball has 20 cm from left and 50 cm from bottom edge. Another scenario to clearly describe the balls is by using X and Y coordinates or 20, 50 values for x and y for the two ball dimensions.

Here is now the deal, if we had a set of data from the patient, each of them could have a unique measurements, eg. Cholesterol level, the patient pulse, blood pressure, age, etc. therefore each of the items has its own unique dimensions.

The point is that support vector machine does it, mapping the items in a higher dimension and finds a hyper-plane for it to effectively separate these items into such classes.

Support vector machine are commonly associated with margins, this is the distance of the hyper-plane and two of the closest points from other classes. The table balls for instance, the distance with the stick that is closest to blue and red ball is what we call the margin.

The key point here is that, support vector machine attempts to effectively maximize these margins, so that the hyper-plane can be used to which is closer to the balls, it can also decrease the possible chance of data misclassification.

Using the said example, the table and the ball, hyper-plane is now identified as equidistant from the blue ball and the red ball. These data points or balls are identified as support vectors. This is because they will support the hyper-plane.

Now the question, is it unsupervised or supervised? Basically, this will be a supervised learning, since the data set is now used to teach SVM about classes, it is only when SVM are capable of effectively classifying the new data.

So why do we need to use support vector machined next to C4.5 that are identified as two classifiers that will try the first data. The answer is simple, there is no best classifier, and this is due to the basis of the theorem of “No Free Lunch.”

Where this will be used?

Basing from the design, support Vector Machine has many implementation. One of the most popular one are MATLAB, scikit-learn and libsvm.

6. The EM

This type of algorithm is considered as the most difficult algorithm to explain, but I will try my best to share the things that I can share to you. It might not be perfect, but rest assured that the information that will be shared undergoes a thorough group brainstorming with my colleagues.

What does this do? In the world of data mining:

EM (expectation maximization) is used as clustering algorithm similar to what k-means, but it is for discovery of knowledge concept.

In stats, expectation maximization algorithm optimizes and iterates the similarity or likelihood on the set of data that are observed while estimating parameters based on statistical model with unobserved data variables.

Just hang on. We can feel you, so let us move to the next part of the discussion. I am not a statistician, and hopefully my simple simplifications are correct and can add information on the understanding. So I have here some concepts for you to understand it easier.

What is a statistical model?

Basically, this is a model that describes something on the data that observed how it is generated. Example, grades for the exam could bell a curve, so assumptions that grades are generated via bell curve (it is about normal distribution) is the used model.

Wait, what is then a distribution stands for? This represents a probability for the measurable outcomes of the data set. For instance, exam grades could fit to the normal distribution.

The distribution now represents as the probability of the grade. Therefore, given the grade, you can now use a distribution in order to effectively determine how many students are expected to get that grade. Cool right?

So what are the parameters of the model? The parameter simply describes on the distribution in the model. For instance the bell curve can also be described as variance and the means.

Using exam scenario, the grade distribution on the exam (outcome to be measured) will follow a bell curve (the distribution). So mean is 85 and variance was 100. So after that, the only thing you need is to clearly describe the distribution of two parameters.

The mean, variance and likelihood, looking at the previous example where it shows a bell curve. Let us assume that we have a set of grades and told that grades follow a curve (the bell curve). However, we are not given a grade only by sample.

The deals here is that, the mean and variance of the grades are unknown, but by using the sample we can create an estimate. Likelihood here is the probability on the bell curved with an estimated variance and mean based from the set of sample grades.

Therefore, given the measurable outcomes, let us now estimate the possible parameters. Using estimated parameters, hypothetical outcomes of the probability is defined as likelihood.

Always remember that, hypothetical outcome of existing grades is the probability, not the future grade probability, so you might be wondering what the probability is. By simply using the curve as example, you can determine the mean and variance.

Then told that the grades follow a certain curve, our chances here is the observable grades and how often they are being observed and that is probability. In other terms, given the set of parameters, let us now estimate what possible outcome is to be observed. That is now the probability and it does for us.

Now, what is the main difference between unobserved and observed data?

An observed data is simply a data that can be recorded or seen. But, unobserved data are those data that are missing. Basically, there are many reasons of a data loss (ignored, not recorded, etc.).

Let me discuss more, in clustering and data mining what is important is to look at the data point classes that are missing. Obviously, we don’t know the classes that are missing, so missing data interpretation could be crucial in the implementation of the EM for task clustering.

Again, EM algorithm simply optimize and iterates the likelihood of the data that are observed while it creates and estimate based on parameters from the statistical model for the unobserved variables. Hopefully, that is clearer now.

Now the best part again, by simply optimizing likelihood, EM then generates remarkable model that can assign class label in the data points (this one sounds like clustering to me).

How does expectation management help with the clustering?

The answer to the question is simple, EM starts by making guess base from the model. Then it follows three iterative step processes.

The E step, this is based on model parameters, the probability calculation for the assignment of data point in the cluster.

E-step: Based on the model parameters, it calculates the probabilities for assignments of each data point to a cluster.

M-step: it updates the parameter of the model based on the result of the assignment of the E-step cluster.

Then you can repeat the process until such time that the parameter in the model and data clustering becomes stable or “convergence”.

Now, is it unsupervised or supervised?

Since the label in the class information is not provided, this will fall in the unsupervised learning. Why then used EM? The key point for EM is very simple and easy to implement. And also, the model parameters can be optimize, it also create and integrated guesses about the missing data in the data set.

With EM it is god for clustering and creating a model based from parameters. The significant of knowing these clusters and the model parameters, this is possible due to the fact that clusters have common patterns to the new data in which it will belong to.

EM has also its weaknesses and we have found some:

  1. First, because EM is fast at early iterations, it is also slow in the later iterations.
  2. Second, EM does not always discover the most favorable parameters and somewhat stuck in a local optima than the global optima.