What is Hypothesis Testing?
Hypothesis testing is the process of testing a hypothesis against its opposite to find whether it’s true or not. If we want to test if two variables are related, the alternative hypothesis states that there is a significant relationship between them, and the null hypothesis states the opposite, that there is no relationship.
Note that we don’t test two hypotheses at once. If our alternative hypothesis is that there is a positive correlation between two variables, our null hypothesis is not that there is a negative correlation: it’s that there is not a positive correlation (so, there is either a negative or no significant correlation). The null hypothesis is always the mutually exclusive converse of the alternative hypothesis.
In any hypothesis test, there are two possible results: we accept the null hypothesis (ex. if there is no significant difference between our variables), or we reject the null and accept the alternative (ex. if there is a correlation of statistical significance).
The ideal statistical test always accepts a true null hypothesis and rejects a false null hypothesis, but every test has some margin of error (corresponding directly to your confidence interval/p-value). As such, there are really four possible outcomes: we accept the null correctly, we accept it incorrectly, we reject the null correctly, or we reject it incorrectly.
A type I error is the incorrect rejection of the null hypothesis; a type II error is the incorrect acceptance of it.
When a Type I Error Is Acceptable
Completely eliminating type I and type II errors is a statistical impossibility, but the design of a test can impact the amounts of each. Depending on your situation, a type I error might be an acceptable trade-off.
In medical screenings, for example, a simple and inexpensive test is administered to a large number of people, none of whom present any symptoms. These are designed to bring to the doctors’ attention anyone who has any significant change of having the condition, which means the rates of false negatives (type II errors) must be minimized. As a result, a large number of type I errors are made. The false positives from medical screenings are typically sorted out afterward by more complex and expensive tests.
Airport security in the United States is another example of a situation where type II errors are minimized by producing a large number of type I errors. The overwhelming majority of times that a detector goes off, it’s because of something inconsequential: a watch, a shirt made of a peculiar fabric, a metal lunchbox, a belt buckle. But since airport security minimizes false negatives, it creates a lot of false positives.
When a Type I Error Is Unacceptable
In any situation where type II errors must be minimized, an abundance of type I errors is typically the by-product. Conversely, if type I errors are unacceptable, then type II errors typically take their place.
In email spam detection systems, for example, type I errors (pushing non-spam email to the spam folder) are much less acceptable than type II errors (mistakenly leaving spam alone in the inbox). The former causes significant annoyance to the user, because they are missing potentially important emails; the latter causes minor inconvenience, because they must manually delete the spam. An optimized spam filter will filter out as many spam emails as possible while ensuring no non-spam emails get marked as spam.
In general, what types of errors are more acceptable to you determines the setup of your test. To some extent, both errors can be minimized at once, but it’s impossible to completely eliminate either one, and minimizing one tends to increase the rates of the other. Building in a system to handle the types of errors your test throws (such as the additional tests administered to people who test positive in medical screenings, routine pat-downs and bag searches in airport security, or a “mark this as spam” button) will improve your overall user experience.