In my last thread, I talked about the Perceptron Trick and its flaws. Today, I learned about the next evolution: Probability-based Calculation (the engine behind Logistic Regression).
In the Perceptron Trick, we only look for misclassified data points and move our decision boundary (the line) towards them.
But Logistic Regression takes a different approach. Instead of just fixing mistakes, we look at the points that are correctly classified and try to push the line as far away from them as possible. This ensures we don't just find a line that separates the data, but the optimal line.
But how do we move the line? We update the weights!
W_new = W_old - (learning_rate) * (y - y_predicted) * X_i
Here is the problem: in a standard perceptron, if a value is predicted correctly (meaning y - y_predicted = 0), the weight never changes. We need a way to keep tweaking the weights even if the point is correctly classified.
To fix this, we replace the Step Function with the Sigmoid Function.
Why Sigmoid? A step function just gives us a hard class (0 or 1). A sigmoid function gives us a probability (between 0 and 1). Even if a data point is correctly classified as positive, its probability might only be 0.7. There is still a 0.3 probability of it being negative. This tiny margin of error means our equation never completely zeroes out, constantly nudging the weights to push the line away and increase confidence.
The sigmoid function alone isn't enough to find the perfect line for unseen data. We need a Machine Learning approach: define a loss function and find the weights that give us the minimum loss.
First, we use the Maximum Likelihood Estimation (MLE) concept. It helps us identify which model is performing best by multiplying the predicted probabilities of all correct classes:
y1_pred * y2_pred * y3_pred ... * yn_pred
The Issue: When you multiply a large number of probabilities (which are decimals between 0 and 1), you get an incredibly tiny number (like 0.000000003). Computers hate this (it causes numerical underflow).
To solve this tiny-number problem, we introduce Cross-Entropy. It essentially takes the negative log of these products. Since taking the log turns multiplication into addition, it's much easier to compute.
Also, since MLE looks for the maximum probability, adding a negative sign means we are now looking for the minimum Cross-Entropy (Loss).
But simply doing -log(y1_pred) - log(y2_pred) isn't enough because we need to account for both positive and negative classes. That leads us to the actual Log Loss formula:
Loss = - [y * log(y_pred) + (1 - y) * log(1 - y_pred)]
This formula looks a bit weird, and it doesn't have a closed-form solution. So, we use Gradient Descent to find the optimal minimum.
I'm going to skip over the heavy calculus of the gradient descent part (because honestly, taking the derivative of that was terrible), but the beautiful part is that it simplifies down to this:
Gradient = (1/m) * (y_pred - y) * X_i
So our final weight update rule becomes:
W_new = W_old - learning_rate * (1/m) * (y_pred - y) * X_i
And there you have it! This is the real math running inside the scikit-learn library behind Logistic Regression.
Uff that is to much to digest at once.....Hope you enjoy it as I did...
No responses yet.