except write a.. oh wait

All researchers have had trouble or will have trouble with outliers in their data. Outliers can cause a number of problems, one of which is it can drastically affect the value of the mean. This can cause problems when it comes to statistical analysis. The problem with outliers is that it is not always obvious why an outlier has come about. Sometimes an outlier can be caused by genuine variables such as higher intelligence in an IQ test or slow reaction times if someone is tiered. However, sometimes a participant can deliberately confound the results of an experiment, or do so by not paying full attention or simply not caring. Researchers may also input the data wrongly or the measuring equipment used may be slightly faulty. When these values are obvious within a data set, it is often justifiable to remove them.

This website http://www.graphpad.com/library/BiostatsSpecial/article_38.htm, recommends the Grubbs’ method for detecting outliers which enables you to detect when a value is unlikely to have come from the population sampled. The author recommends removing these outliers so not to confound data, but also documenting your decision. The author does, however, mention that some feel you should never remove an outlier unless you notice an experimental problem.

If an outlier is caused by a participant not concentrating at all on a task (such as a reaction time task where the participant constantly presses the reaction key on the computer but does not look at the screen because they are board), their result is not valid. If their result is not valid and is so extreme to affect the mean, greatly, their result, in my opinion should definitely be removed from the data. Outlies can cause both type 1 and type 2 errors and this should be considered if an outlier is a great number of standard deviations away from the mean.

It is important that experimenters do trials before gathering their actual data as a poorly implemented design can also cause outliers. It may lead to participants not understanding or the ability of the participant being over or under estimated from the very beginning. For example, administering an easy IQ test to University students may produce very high results for all participants, confounding results.

Research has suggested that researchers disagree on the appropriateness of deleting data points from a study (Sackett and Dubois, 1991). If this is so, it would make sense that the results given to  us in an article may have been different had a different researcher analysed the results. If this is so, we should consider drawing our own conclusions about the data  and not trusting everything we read. Especially if outliers are not reported in the research, which has happened in the past, which can completely transform data.

Overall, unless it is extremely obvious that an outlier was caused by an experimental fault or an uninterested participant, it may not be justifiable to remove an outlier from data. After all, we research our hypotheses to answer questions, we shouldn’t be manipulating data in order for it to agree with our own expectations. However in some cases, removing outliers is unavoidable and, as long as it is reported and justified, I think that it is the right thing to do for more valid results.

Comments on: "Is it dishonest to remove outliers and/or transform data?" (4)

  1. As stats are for summarising data, outliers can be an indication of ‘another population’~ so yes, when removing I always document why. This could be an interesting difference. Removal doesn’t need to mean ignored, just removed from a group of those ‘more like’~ especially if a value is more than 2 SDs from the mean.

    It’s a tragedy tho, when an outlier is seen as ‘bad’, is removed and ignored~ instead of the researcher asking~ gee, I wonder what makes that person so different from the others; e.g., I wonder why they were really slow on their RT…leads to some interesting research directions.

    Yesterday I edited a thesis where students in same year level on one campus had responses significantly different to those on another. So once campus was cut from the sample! Their data completely ignored. Why?

    And the scores on empathy ratings, amongst undergrad psych students~ that’s interesting! IMO Why does one campus of psych students have empathy ratings significantly different to the other, if they have same curriculum?

    Could it be the social culture modeled to them be staff? The community values where the campus is embedded? Is there a high rate of Buddhist students ~:-) in the sample, or vice versa~ a high proportion of profit-hungering clinical orientated apprentices?

    Thus~ not always dishonest, but more than likely to the detriment of social science and the wellbeing of humans~ to ignore the outlier, even if cut from the sample for particular analyses.

  2. This is I believe a very difficult question to answer, and can become quite murky and cloudy. I believe the answer to this question lies in establishing why they are there. They cause a number of problems to research, such as effecting validity and reliability. Researchers need to use their instinct, experience and judgment when dealing with outliers, and often requires an experienced eye.

    Outliers can occur by chance, human error, intentional or motivated mis-reporting by participants, sampling error and so on. When dealing with outliers, researchers whether they decided to leave in or remove need to try and ensure that findings are reliable and valid, this is the core to any research, so if they remain or are removed has an effect on validity and reliability then it needs to be questioned.

    To remove an outlier for manipulation of findings, so as to support hypothesis, would obviously be a form of cheating, and should not under any circumstance be done. They should only be removed for the benefit if truth, or to get closer to the truth. I don’t believe it is advisable to ignore outliers, as there is usually – as mentioned above, a reason for their occurence that is not genuine and real to the research, they require special attention and should always be looked for. To identify, the usual simple rule of thumb is, if data is 3 standard deviations from the mean it is an outlier, and carefull consideration should be given to it and what to do with it.

  3. I really liked your effective use of the website, do you think that in the long run it would be easier for all researchers if they just used the same rules of removing outliers? Would this eliminate the dishonestness of removing the data? Generally I think it may open another can of worms because if researchers knew that if they just gave a reason as to why they removed the piece of data then they could get round the fact that they would be cherry picking their data to comply with their hypothesis. Your point on removing outliers when the participant was not concentrating on the task was very good, however how do you actually know that they weren’t concentrating? Also if you get a normal set of results, which looked valid, however the participant got these results by chance and they were not concentrating, is this still honest to keep these sets of data in? I thought your blog was really interesting to read. Well Done!

  4. Outliers can occur by chance and can be down to natural deviations in populations. According to the 68-95-99.7 rule, or three-sigma rule, or empirical rule, 1 in 3 observations will differ by one standard deviations and 1 in 22 will differ by two standard deviations. Outliers will be expected in any large data set so they should not be removed automatically. The researchers judgements of the removal of an outlier is paramount. Not everyone will agree to the removal of the outlier but as long as the explanation is justified and the data has not been manipulated to just show what the researcher wants it to show then there should be no reason or questioning of the removal of outliers.

    http://en.wikipedia.org/wiki/68-95-99.7_rule

Leave a comment