It's a well accepted rule of thumb that the larger the data set, the more reliable the conclusions drawn. Simpson' paradox, however, slams a hammer down on the rule and the result is a good deal worse than a sore thumb. Unfortunately Simpson's paradox demonstrates that a great deal of care has to be taken when combining small data sets into a large one. Sometimes conclusions from the large data set are exactly the opposite of conclusion from the smaller sets.  Unfortunately, the conclusions from the large set are also usually wrong.

To understand this effect we'll use a set of simulated data. Table 1 shows the average physics grades for students in an engineering program. This is a difficult class used for weeding out weaker students. Most of these students prepared for college by taking high school (HS) physics. The data illustrates that there is a ten point advantage for those with HS physics. Table 2 shows the average physics grades for students in a liberal arts program. This class is designed as an elective course for the enrichment of students who would otherwise avoid physics. Few students have prepared for this class by taking HS physics. However, those few who took HS physics have a 10 point grade advantage. In both classes taking physics clearly produced an advantage.

We now combine the data sets. The combined results for students who took physics are shown in table 3. The average college physics grade has been determined by adding all the grade points (4475) and then dividing by the total number of students (55). Table 4 shows the same results for the students without HS physics. The results of tables 3 and 4 indicate that students who take physics perform worse than those who don't by 2.3 points. This is the opposite conclusion from the conclusion of tables 1 and 2.

Obviously, combining the data sets gives a misleading picture but why? The answer lies in two parts. First, the data sets for the two major groups (engineering and liberal art students) were influenced by a lurking variable, course difficulty. The engineering students received a rigorous course. The liberal arts students a less demanding enrichment course. Second, the groups in the data sets were not the same size.  This caused the average of college physics grades to be weighted toward engineering student grades for those who had taken HS physics. Since the engineering students' course was more rigorous it lowered the average. The opposite was true for the combined results of those who didn't take HS physics.

 HS Physics None Improvement Student 50 5 --- Ave Grade 80 70 10

Table 1. Average college physics grades for students in an engineering program.

 HS Physics None Improvement Student 5 50 --- Ave Grade 95 85 10

Table 2. Average college physics grades for students in a liberal arts program.

 # Students Grades Grade Pts Engineering 50 80 4000 Lib Arts 5 95 475 Total 55 4475 Average --- 81.4 ---

Table 3. Average college physics grades for students who took high school physics.

 # Students Grades Grade Pts Engineering 5 70 350 Lib Arts 50 85 4250 Total 4600 Average 83.6

Table 4. Average college physics grades for students who didn't take high school physics.

There were four separate groups in the study as follows:

1. Engineering students with HS physics
2. Engineering students without HS physics
3. Liberal arts students with HS physics
4. Liberal arts students without HS physics

If all the four groups had been the same size, the results would have indicated that students with HS physics had a 10 point advantage in their college physics grades regardless of the type of college physics they took. Likewise if an average had been calculated which was not weighted toward group size, the results would also have  indicated the same 10 point advantage.

Conclusions

Simpson's Paradox is caused by a combination of a lurking variable and data from unequal sized groups being combined into a single data set. The unequal group sizes, in the presence of a lurking variable, can weight the results incorrectly. This can lead to seriously flawed conclusions. The obvious way to prevent it is to not combine data sets of different sizes from a diverse sources.

Simpson's Paradox will generally not be a problem in a well designed experiment or survey if possible lurking variables are identified ahead of time and properly controlled. This includes eliminating them, holding them constant for all groups or making them part of the study.