Wednesday, 1 July 2009

I get my name in the Veterinary Record

This is somewhat old news, but I haven't had chance to write about it before. To add to the publications I have in Homeopathy, I now have one (as third author) in the Veterinary Record. This is starting to get silly; I'm supposed to be a geologist.

Perhaps unsurprisingly, this is related to a terrible homeopathy study [Hill et al., The Veterinary Record 164:364-370], this time on the treatment of skin conditions in dogs. It's another example of homeopaths continuing to do small, badly designed studies, when plenty of large and properly conducted studies, and systematic reviews and meta-analyses of those studies, show that homeopathy doesn't work. The letter I am involved in is one of three letters that were published criticising the study: they can be found, with the author's reply, at The Veterinary Record 164: 634-636 [apologies for the lack of links: there's no DOI for these that I can find]. There is also an excellent discussion of the paper, and some of the responses to it, over at JREF.

The design in this study is truly extraordinary. Initially, 20 dogs with skin problems were recruited to the study. All were treated with individualised remedies by a homeopath. In 15 cases, the dog owners reported no improvement. In 5 cases, the owners reported a significant improvement. Not looking good for homeopathy so far. Still, the five improved dogs were said to have responded well to homeopathy, and went on to phase 2, which was a proper randomised and blinded placebo-controlled trial. Unfortunately, one dog had to be euthanased before the trial could happen, and another dog's skin problems had resolved completely after the first stage, leaving only three dogs in phase 2. Supposedly, those dogs did better with homeopathy than with placebo, thus justifying, as ever, "further research".

This is possibly the easiest study to criticise that I've ever seen. Put simply, the first phase lacks a control group, so improvements cannot be attributed to homeopathy. There is simply no evidence that the five dogs recruited to phase 2 actually responded to homeopathy, rather than just improved spontaneously. Then the second phase of the trial includes only three dogs. There is no way to interpret the results of such a tiny, underpowered study. Those are the main problems, but there are others. For example, all the dogs were on some kind of conventional medication, so that cannot be ruled out as contributing to any improvement.

The only reasonable conclusion from the study is that there is no strong evidence that homeopathy did anything for the dogs in the trial. But the paper concludes that the improvement seen in the five dogs (which again cannot be attributed to homeopathy on the basis of this study) is enough to justify further research. No doubt the paper will also be spammed all over the internet by the likes of Dana Ullman, as proof positive that homeopathy works. Hopefully the letter I'm a co-author on, along with the two other letters critical of the study that were published, will go some way to addressing that. The signs are not good, though. The original Hill et al. paper included the statement that "Different homeopathic remedies and different
dilutions of the same remedy have been distinguished from each other using Raman and infrared spectroscopy, even though all should contain nothing but water", with a reference to "Rao and others, 2007" [In fact, Rao et al. did not even claim that infrared spectroscopy showed any difference]. Regular readers will know that Rao and colleagues did nothing of the sort, and that to describe their paper as "discredited" would be something of an understatement. In the world of homeopathy, discredited papers never die. They are just recycled for use with audiences who don't know that they've been discredited. I suspect that this one will be no different.

As an aside, my favourite part of this study is that "constitutional signs" of each of the dogs, as used by the homeopath to pick a remedy, are listed [Table 2 of the paper]. For dog number 16, these are listed as:

Affectionate
Fears thunderstorms
Clairvoyant
Grief
Desires chicken; oranges aggravate


A clairvoyant dog! And this was published in a respected veterinary journal.

Wednesday, 24 June 2009

What do bibliometrics actually add to research evaluation?

Firstly, the reason that I haven't posted in an age is that I've been in Norway, interpreting seismic data for the new project I'm working on. Hopefully I can now post a bit more regularly, as I should actually be in Manchester for a few consecutive weeks, for the first time this year.

Regular readers will know that I like to whinge about the increasing use of statistical indicators (bibliometrics) to evaluate research performance. Previously in England, research performance has been evaluated by the Research Assessment Exercise, a cumbersome and involved system based around expert peer review of research. Currently, HEFCE (the body that decides how scarce research funding is allocated to English universities) is looking into replacing this with a cumbersome and involved system based around bibliometrics and "light-touch" peer review. To this end, a pilot exercise using bibliometrics and including 22 universities has been underway. An interim report on the pilot is now available.

Essentially, three approaches have been evaluated:

i) Based on institutional addresses: here papers are assigned to a university based on the addresses of the the authors, as stated in the paper. This would be cheap to do, as it would need no input from the universities.

ii) Based on all papers published by authors. In this approach, all papers written by staff selected for the 2008 RAE were identified. This requires a lot of data to be collected.

iii) Based on selected papers published by authors. Again, this approach used all staff selected for the 2008 RAE, but only used the most cited papers.

For each approach, the exercise was conducted twice: once using the Web Of Science (WoS) database, and once using Scopus. The results were then compared with those from the 2008 RAE.

Well, the results are interesting, if you like this sort of thing. It is clear that the results can be very different from those provided by the RAE, whichever method was used, although the "selected papers" method tends to give the closest results. It is also notable that the two different databases give different results, sometimes radically so; Scopus seems to consistently give higher values than WoS. Workers in some fields complained that they made more use of other databases, such as the arXiv or Google Scholar (it's worth noting that the favoured databases are proprietary, while the arXiv and Google Scholar are publically accessible).

In general, the institutions involved in the pilot preferred the "selected papers" method, but it seems that none of the methods produced particularly convincing results. According to the report (paras 66 and 67):

In many disciplines (particularly in medicine, biological and physical sciences and psychology), members reported that the ‘top 6’ model (which looked at the most highly cited papers only) generally produced reasonable results, but with a number of significant discrepancies. In other disciplines (particularly in the social sciences and mathematics) the results were less credible, and in some disciplines (such as health sciences, engineering and computer science) there was a more mixed picture. Members generally reported that the other two models (which looked at ‘all papers’) did not generally produce credible results or provide sufficient differentiation.

One of the questions here is what is meant by "reasonable" or "credible" results? The institutions involved in the pilot seem to assume that the best results are the ones that most closely match those of the RAE. I suspect this is because the large universities that currently receive the lion's share of research funding are not going to support any system that significantly changes the status quo.

The institutions involved in the pilot seem to think that bibliometrics would be most useful when used in conjunction with expert peer review. From the report:

Members discussed whether the benefits of using bibliometrics would outweigh the costs. Some found this difficult to answer given limited knowledge about the costs. Nevertheless there was broad agreement that overall the benefits would outweigh the costs – assuming a selective approach. For institutions this would involve a similar level of burden to the RAE and any additional cost of using bibliometrics would be largely absorbed by internal management within institutions. For panels, some members felt that bibliometrics might involve additional work (for example in resolving differences between panel judgements and citation scores); others felt that they could be used to increase sampling and reduce panels’ workloads.

According to the interim report, the "best" results (i.e. those most closely matching the results of the RAE) were obtained using a methodology that will have a similar administrative burden as the RAE. Even then the results had "significant discrepancies". So, if the aim of the pilot was to get similar results to the RAE with a lesser administrative burden, it seems that the pilot exercise has failed on both counts. So if bibliometrics don't seem to add much to the process, it's worth considering what they might take away. For which, see my previous post...

Tuesday, 5 May 2009

The usual excuse for not posting

Yes, I've been hanging about in Egypt again, looking at rocks for my day job. In the absence of any bad science related stuff, here are some pretty pictures.

El Tor, the town where we stayed, at sunset.


Downtown El Tor.


Fossilised burrows in Miocene syn-rift rocks. There's a lot of this in the study area, which usually means that structures that would help to understand the depositional environment are obscured.


Part of the field area. To the right are rocks of the Precambrian basement. In the foreground, a major normal fault separates those Precambrian rocks from Nubian sandstone, Eocene carbonate units, and Miocene syn-rift calc-arenites.


Wednesday, 8 April 2009

Homeopathy paper published

So, this is the moment you’ve all been waiting for. A while ago I wrote a comment on an article that was published in Homeopathy. This article, among other things, purported to show that the authors of a Lancet meta-analysis (Aijin Shang and co-workers) that had negative results for homeopathy had engaged in post-hoc hypothesising and data dredging. That was an outrageous slur on what is a perfectly reasonable paper, if you understand it properly. My comment has now been published, along with a response from the authors. If anyone needs a copy of my comment and doesn’t fancy paying for it, drop me a line and I’ll bung you a PDF. In any case, the original version appears on my blog here.

Meanwhile, the reply by original authors Rutten and Stolper is an exercise in evasion and obfuscation, and doesn’t really address most of the points that I made. This seems to be fairly typical (and to be fair isn’t only restricted to non-science like homeopathy). In their original paper, Rutten and Stolper claimed that “Cut-off values for sample size [i.e. the number of subjects in a trial, above which the trial was defined as “large”] were not mentioned or explained in Shang el al's [sic] analysis”. This is simply not true. So what do Rutten and Stolper have to say about this embarrassing error?

Wilson states that larger trials were defined by Shang as “Trials with SE [standard error] in the lowest quartile were defined as larger trials”. According to Wilson this was done to predefine 'larger trials'. We agree with Wilson that this is indeed a strange way of defining 'larger trials', but it is perfectly possible to simply define larger studies a priori according to sample size in terms like 'above median' as we suggested in our paper. Shang et al did not mention the sensitivity of the result to this choice of cut-off value: if median sample size (including 14 trials) is chosen homeopathy has the best (significantly positive) result, if 8 trials are selected homeopathy has the worst result. In the post-publication data they mentioned sample sizes but not Standard Errors. Isn't it odd that the authors did not mention the fact that homeopathy is effective based on a fully plausible definition of 'larger' trials, but stated that it is not effective based on a strange definition of 'larger', but that this was not apparent because of missing data?

So, nothing there about how they failed to properly read the paper to check what Shang et al.’s definition of larger trials was, while essentially accusing them of research misconduct. Instead, they shift the goalposts and decide that they don’t like the definition that was provided. Now, it certainly would be possible to define larger studies as being “above median” sample size. By doing this you would be including studies of smaller size than would be included using Shang’s definition. As is well understood, and as Shang et al. clearly showed, including studies with smaller sample size will give you more positive but, crucially, less reliable results. So I don’t think it was particularly odd that Shang et al. failed to abandon their definition of larger trials in favour of someone else’s definition, published three years later, that would inevitably lead to less reliable results. Rutten and Stolper state that using 8 larger, high quality trials gives the worst results for homeopathy: but to get a positive result, you would have to include at least 14 trials, as Ludtke and Rutten show in another paper in the Journal of Clinical Epidemiology. And, again, it was perfectly apparent what definition Shang et al. used to define larger trials: it is clearly stated in their paper.

OK, so why use standard error rather than simply using sample size directly, as Rutten and Stolper want to do? In meta-analyses, a commonly used tool is a funnel plot. This plots, for each study included in the analysis, standard error against odds ratio. The odds ratio is a measure of the size of the effect of the intervention being studied. If the value is 1, there is no effect. If it is less than one, there is a positive effect (the intervention outperformed placebo), if greater than one there is a negative effect (placebo outperformed the intervention). The plot is typically used to identify publication bias (and other biases) in the set of trials: to simplify, if the plot is asymmetric, then biases exist. Using their funnel plot of 110 trials of homeopathy (Figure 2 in the Lancet paper), Shang et al. were able to show, (to a high degree of statistical significance, p<0.0001)that trials with higher standard error show more positive results. It then makes perfect sense to screen the trials by standard error rather than sample size, because it has been demonstrated that standard error correlates with odds ratio. Of course, you could plot sample size against odds ratio, but that is not the recommended approach.

Rutten and Stolper also claim to be "surprised" that one apparently positive trial of homeopathy was excluded from Shang's analysis. Since it was excluded based on the clearly stated exclusion criteria, I didn't find that surprising myself. How do Rutten and Stolper respond?

"We were indeed amazed that no matching trial could be found for a homeopathic trial on chronic polyarthritis by Wiesenauer. Shang did not specify criteria for matching of trials. We would expect the authors to explain this exclusion because Wiesenauer's trial would have made a difference in meta-regression analysis and possibly also in the selection of the eight larger good quality trials".

This routine is now wearily familiar. Someone makes a claim that Shang et al. didn’t do something, in this case specify criteria for matching of trials; I check the Lancet paper, and find that claim to be false. What did Shang have to say about matching of trials? On page 727, they say “For each homoeopathy trial, we identified matching trials of conventional medicine that enrolled patients with similar disorders and assessed similar outcomes. We used computer-generated random numbers to select one from several eligible trials of conventional medicine”. And, of course, the authors did explain why the trial was excluded; it met one of the pre-defined exclusion criteria. To me, that seems clear enough. As it stands, Rutten and Stolper’s point is nothing more than an argument from incredulity. They are amazed! Amazed that no matching trial could be found. But they haven’t actually found one to prove their point. It’s possible that this Weisenauer trial might have made a difference to the selection of 8 large, high quality trials. But I doubt it would have made any significant difference to the meta-regression analysis, which was based on 110 trials.

Having wrongly accused Shang et al. of doing a bad thing by defining sub-groups post-hoc, Rutten and Stolper applied all kinds of post-hoc rationalisations for excluding trials they don’t like. For example, they decided to throw out all the (resoundingly negative) trials of homeopathic arnica for muscle soreness in marathon runners, on the basis that homeopathy is not normally used to treat healthy people, and these trials therefore have low external validity. I argued that Shang et al. had to include those studies, since they met the inclusion criteria and did not meet the exclusion criteria. On what basis could they exclude them? From Rutten and Stolper, answer came there none:

"Wilson's remark about prominent homeopaths choosing muscle soreness as indication is not relevant. Using a marathon as starting point for a trial is understandable from a organisational point of view, although doubt is possible about external validity. Publishing negative trials in alternative medicine journals is correct behaviour. There is, however, strong evidence that homeopathic Arnica is not effective after long distance running and homeopathy as a method should not be judged by that outcome".

Yes, publish the negative trials. But why shouldn’t the negative trials be included in a meta-analysis? Because they’re negative, and that just can’t be right? I don’t see any rationale here for excluding these trials.

Rutten and Stolper also take the tine-honoured approach of arguing about statistics:

“…the asymmetry of funnel-plots is not necessarily a result of bias. It can also occur when smaller studies show larger effect just because they were done in a condition with high treatment effects, and thus requiring smaller patient numbers”.

I think this is nonsense, but anyone with more statistical knowledge should feel free to correct me. If the high treatment effects are real, then the larger studies will show them as well, and there will be no asymmetry in the funnel plot. The smaller studies are always going to be less reliable than the larger ones.

Finally, Rutten and Stolper conclude that:

"The conclusion that homeopathy is a placebo effect and that conventional medicine is not was not based on a comparative analysis of carefully matched trials, as stated by the authors".

Homeopaths do want this to be true, but no matter how many times they repeat it, it continues to be false. I think the problem is that they have become fixated on the analysis of the subgroup of larger, higher quality trials, which was only one part of the analysis. The meta-regression analysis for all 110 vs 110 trials gave the same results; the analysis of the “larger, higher quality” subgroup merely lends support to those results. So after all that palaver, there’s still no reason to think that there is anything particularly wrong with the Shang et al. Lancet paper, and there is certainly no excuse for accusing its authors of research misconduct.

Friday, 20 March 2009

Bloody Elsevier

Some time ago, I had a paper on normal fault evolution in the Gulf of Suez accepted for publication in the Journal of Structural Geology. This is an Elsevier journal, and the paper duly went off to the Elsevier production people to be published. Now, one of the figures in the paper is a large and spectacularly detailed geological map of the study area, done in the late 1990s by my co-author and former University of Manchester post-doc Ian Sharp. This is an excellent piece of work in itself, and it had never been published; we decided that this paper would be a good place to finally publish it. The level of detail on this map is such that we wanted to reproduce it in colour, at A3 size. We knew that this would cost money, but the industrial sponsors of the work were happy to cover the costs.

After a long and generally fruitless attempt at corresponding with the Elsevier production department (which has been outsourced to India, incidentally), I finally received a PDF proof of the paper in which the geological map was reproduced at A3 size. All well and good. Until the final version of the paper was published [paywall: for God's sake, don't pay $31.50 for this...if you really want a copy, e-mail me and I'll send you a PDF], and the map was back to A4 size, with much of the fine detail lost as a result.

Gah!

Now, surely it isn't on for Elsevier to unilaterally make changes to an article without consulting the authors about it. I know some people who have been involved in editing this journal, and it seems they are unhappy with how it is being run by Elsevier. As Dr Aust points out, companies like Elsevier charge large amounts of money for papers, in just about the only example of publishing in which the authors don't want to be paid for producing all the content. Elsevier makes massive profits out of journal publishing, gets to hide all of the content behind ridiculous paywalls, and doesn't even make a particularly good job of the journal production. There must be a better way.

Thursday, 19 March 2009

What is the Russell Group for?

The Russell Group contains the 20 major research-intensive universities in the UK. The University of Liverpool is a member of the group, and has recently made the news by earmarking its departments of Politics and Communication, Statistics, and Philosophy for closure. The reason is that those departments are seen as having underperformed in the 2008 RAE (Research Assessment Exercise).

In the RAE, departments are ranked by the proportion of research they have in five different categories, as follows:

4*: Quality that is world-leading in terms of originality, significance and rigour.

3*: Quality that is internationally excellent in terms of originality, significance and rigour but which nonetheless falls short of the highest standards of excellence.

2*: Quality that is recognised internationally in terms of originality, significance and rigour.

1*: Quality that is recognised nationally in terms of originality, significance and rigour.

Unclassified: Quality that falls below the standard of nationally recognised work. Or work which does not meet the published definition of research for the purposes of this assessment.


The three departments faced with closure had no research ranked in category 4*. According to Times Higher Education, "The university has questioned whether this is “acceptable” for a member of the Russell Group of 20 research-led institutions".

So, how did the threatened departments do overall? Here's their breakdown from the 2008 RAE (source):


Statistics: 4*, 0%; 3*, 35%; 2*, 50%; 1*, 15%; UC, 0%.

Politics and Communication: 4*, 0%; 3*, 15%; 2*, 55%; 1*, 25%; UC, 5%.

Philosophy: 4*, 0%; 3*, 25%; 2*, 60%; 1*, 15%; UC, 0%.


These results are surely not disastrously bad. In all cases, the vast majority of research is ranked at 3* and 2* levels: that is, it is considered to be internationally excellent or internationally recognised. Is this really such a poor performance that it requires the closure of the departments?

The threat of closure of these departments raises the question of what a university is actually for. If it only exists to receive as much research funding as possible, then closure is a perfectly sensible action. But if you consider the university as a community of scholars, with everyone (from undergraduates to professors) learning from each other, then closing these departments is going to contribute to the narrowing of the university experience for everyone. Is that really what the University of Liverpool wants to acheive? And is that what the Russell Group is supposed to be about?

Friday, 13 March 2009

Another example of bad science with serious real world consequences

Via Respectful Insolence, I came across this story: massive research fraud has been uncovered in the field of anaesthesiology. It appears that one Scott Reuben, MD, is accused of fabricating results in at least 21 studies he conducted in the field of multi-modal analgesia; this is discussed in more detail at Science Based Medicine. The studies are now in the process of being formally retracted by the journals that published them.

It's difficult to over-emphasise the seriousness of this. Recommendations about best practice for pain management have been made on the strength of these studies. It is now not clear that those recommendations are appropriate. Until further studies are done to sort this mess out, people are going to be denied the best possible standard of care. Bad evidence has consequences.

What is particularly galling about this case is that it was not uncovered through the scientific method. Peer review didn't uncover it, and neither did a failure to independently replicate Reuben's results. In fact, it was eventually uncovered because it was noticed that Reuben did not have approval to conduct research on human subjects for two abstracts he had submitted for presentation. The scientific community has nothing to be proud of here. Fair enough, it's largely impossible for peer review to spot fraud: there has to be a degree of trust that the data presented is not simply fabricated. But fraudulent research has entered the literature, and had recommendations based on it. Make no mistake about it, this is a massive failure. It's no good saying that the scientific method ensures that such fraud will eventually be discovered: it didn't ensure it in this case, and by now the damage is done. The science based medicine community needs to urgently consider how this sort of thing can be prevented in future.