As a former experimental physicist who did funded research for several years, published many papers in physics journals, and was on the editorial board of a physics journal for three years, I have several comments about the quality of medical publications and the interpretation of their results by the lay press.
First, it seems to be a common habit to publish data points in a graph without error bars. This makes it impossible to interpret the results properly. (For lay persons, the error bars show the range of +/- 2 standard deviations, which means that IF the data and error distribution is Gaussian, then there is a 2.5% probability that a repeat measurement would be above and a 2.5% probability of being below the range of the error bars.) Then this mistake is often compounded by connecting the data points by a sequential series of straight lines, rather than a French curve or a least squares fit.
When a result is presented as being "statistically significant", what is meant is that there is only a 5% probability of the result being incorrect (yes, I know I am simplifying here). However, a statistically significant result may not be clinically significant. It is easy to demonstrate that if you have enough subjects, something will be statistically significant. But is it really medically useful to know how to decrease your risk of being killed by a falling meteorite by 50%?
The lay press is also woefully ignorant of the concept of statistical variation. If you tabulate, for instance, cases of breast cancer in every county in a state, one county has to be the highest, and one has to be the lowest, without any "cause". Every time there is a clustering of cases (as in lymphoma in Passaic, N.J. about 15 years ago), there is a rush to find the cause.
Then the relative risk rather than absolute risk is emphasized. Again, if your chance by being killed by a falling meteorite is one in a million, then if I decrease your relative risk by 50%, I have only lowered your absolute risk by 0.0001%.
Finally, and this is the most egregious mistake of all, is the use of surrogate endpoints. For instance, in the study of the effect of lowering cholesterol by the use of Zetia, instead of looking at heart attacks or stroke as a primary endpoint, the thickness of the intima of the carotid artery was used as a surrogate endpoint. If the most common cause of arterial blockage is rupture of a plaque rather than embolic, then this surrogate endpoint is not medically useful. (Personally, I think the evidence points to the anti-inflammatory effects of ASA and statins as reducing the risk of plaque rupture and acute blockage, but no one has yet been able to detect such an acute event when it happens in humans.)
Actually, in fairness to medical researchers, I should mention the main limitation that they face. In physics, if an experimental result is published, other researchers rush to try to repeat the result by a different experimental technique, using different apparatus, to help establish the validity and uniform applicability of the result. Thus, after Wu, Ambler, Heyward and Hoppes verified the theoretically predicted non-conservation of parity in beta decay using electrons, Lederman and Steinberger verified it by studying the decay of muons (possibly aided by Garwin's suggestions). There have been at least ten different verifications of Bell's inequality, using different experimental setups and techniques, which verifies the "spooky" action-at-a-distance required by quantum measurement theory. As soon as Mossbauer announced his effect, physicists rushed to duplicate it around the world. The speed of light has been measured many, many times, as has the dilation of time predicted by the theory of special relativity. On the other hand, in medical research, there is only one direct way to do the experiment, since no other "experimental equipment" exists. This has at least two consequences:(1) there is less glory in verifying a medical result, even if it is done to higher probability by studying more people , and (2) if a result is "very convincing", then virtually no one will repeat it because it would seem to be a waste of money to the funding office, or a risk of malpractice to the research group.
I should also mention that some experiments are not done, because they are deemed not to be in the public interest:There have been several studies in Europe (usually published in Lancet) that seem to indicate that cigarette smokers have a lower incidence of Parkinson's Disease. This suggests a relationship between nicotinic receptors in the brain and dopaminergic neurons. But I predict that no one in the USA would receive federal funding to do a prospective study to see if, in fact, cigarette smoking does protect against Parkinson's Disease, or, indeed has any other benefit.
All studies of new drugs, and many studies of existing drugs, are done on pharmacologically naive patients, who are on no drugs at the time of the experiments. Since most of my patients are on at least four drugs, the results of the study may not apply to them, both as regards to benefits and side effects.(This is typified by the fact that if I have a patient with diabetes, hypertension, osteoarthritis, chronic hepatitis, and GERD, and try to follow all five of the government guidelines, drugs used to treat one problem conflict with the guidelines for another problem.)
Now that I have expressed several of my opinions, let me complete this article by reviewing definitions of several common words and phrases that you may see in research articles:
NNT---Number Needed to Treat---the statistically suggested number of patients to treat with the studied drug in order to achieve the expected outcome in one.
NNH---Number Needed to Harm---similar in concept to NNT,except it is the number to treat to get a bad outcome. If NNT is greater than NNH, you have a problem, unless, perhaps, the outcome in NNT is preventing certain death.
NNS---Number Needed to Sue----this is not generally listed in statistical textbooks. It is the number of patients out of a million who get a bad enough result that a malpractice lawyer thinks it worthwhile to start a class action suit.
Statistically significant---there is less than a 5% probability (one-in-twenty) that the result is due to chance. Alternatively, if you repeat the experiment 20 times, then you would expect to get the same "result" nineteen times.
Type I/Alpha Error---you conclude that there is a statistically significant difference between the control group and the treated group, when there is really NO difference. Similar in concept to a false positive conclusion.
Type II/Beta Error---you conclude that there is no statistically significant difference between the control and the treated group, when there really is. Similar in concept to a false negative conclusion.
Sensitivity---the probability that if you test positive for a disease, you have the disease; i.e. a test with a low false negative rate.
Specificity---the probability that if you test negative for a disease, you do not have the disease; i.e. a test with a low false positive rate.
Common clinical sense---tells you not to believe a positive test result in a particular patient. If you order a panel of 20 tests, each of which has a Gaussian distribution, then the odds are 50:50 that at least one of the tests will fall outside the "normal" range without indicating true disease.
Correlation/Causation---two events can be related in time or space without having a cause-and-effect relationship. Propinquity can always be coincidental, but can also suggest paths for future research.
Confounding---a factor not considered when looking for a cause-and-effect relationship that affects the effect. The best example would be the initial statistical demonstration that coffee drinkers had a higher rate of heart attacks, without allowing for the confounding effect that more coffee drinkers than non-drinkers smoked cigarettes. It is probably impossible to ensure the absence of all confounding effects, since we don't know about many confounding effects, and it is virtually impossible to test for their existence.
Confidence Interval---Similar in concept to error bars around a measured data point. The confidence interval of a result suggests to you the range of the result in which we expect 95% of the studied population to fall.
Intention-to-treat---included all patients who registered for the randomized drug study, whether or not they dropped out of the study.
Incidence---the percentage of the population that develops a given disease in a given period of time.
Prevalence---the percentage of the population that has the disease at a given time. Note that the prevalence of a disease helps to determine whether you want to emphasize avoiding a Type I error or emphasize avoiding a Type II error, as well as a test with high sensitivity or high specificity.
Endpoint---the result you are looking for to determine that a treatment "works".