
The probability space and general theory
When probability is discussed, it's often referred to in terms of the probability of a certain event happening. Is it going to rain? Will the price of apples go up or down? In the context of machine learning, probabilities tell us the likelihood of events such as a comment being classified as positive vs. negative, or whether a fraudulent transaction will happen on a credit card. We measure probability by defining what we refer to as the probability space. A probability space is a measure of how and why of the probabilities of certain events. Probability spaces are defined by three characteristics:
- The sample space, which tells us the possible outcomes or a situation
- A defined set of events; such as two fraudulent credit card transactions
- The measure of probability of each of these events
While probability spaces are a subject worthy of studying in their own right, for our own understanding, we'll stick to this basic definition.
In probability theory, the idea of independence is essential. Independence is a state where a random variable does not change based on the value of another random variable. This is an important assumption in deep learning, as non–independent features can often intertwine and affect the predictive power of our models.
In statistical terms, a collection of data about an event is a sample, which is drawn from a theoretical superset of data called a population that represents everything that is known about a grouping or event. For instance, if we were poll people on the street about whether they believe in Political View A or Political View B, we would be generating a random sample from the population, which would be entire population of the city, state, or country where we are polling.
Now let's say we wanted to use this sample to predict the likelihood of a person having one of the two political views, but we mostly polled people who were at an event supporting Political View A. In this case, we may have a biased sample. When sampling, it is important to take a random sample to decrease bias, otherwise any statistical analysis or modeling that we do with sample will be biased as well.