## ROC分析当中的AUC和Mann-Whitney U statistic的关系

Receiver operating characteristic (ROC) curve 是在做two group supervised learning非常常用的一个工具。而area under curve (AUC) 又是最常用的一个表述一条ROC曲线的最常用的统计量。

1. 设我们有 $N=5$ 个数据点，可以分成两个组。 第一个组有 $n_1 = 3$ 个数 据点，第二个组有 $n_2 = 2$ 个。
2. 每个点我们知道两种信息：一个（连续的）观察值，记为 $X_i$, 一个就是 它的分组，用”0″（第一组）和”1″（第二组）来代表。
ID X_i Group
1 0.11 0
2 0.56 0
3 1.13 1
4 2.14 0
5 2.29 1
3. 举个例子，假设这两组分别对应于”Normal”和”Disease”。我们希望能从我们 观察到的这5个 $X_i$ 总结出这么一种简单的分类方法：当 $X_i > c$ 我们 判断第 $i$ 个数据点为 “Disease”, 反之则为 “Normal”，这里 $c$ 是一个 可以调整的cutoff point。

R 里我们可以用如下命令来做ROC分析，计算AUC:

library(ROCR)

observations <- c(0.11, 0.56, 1.13, 2.14, 2.29)
groups <- c(0, 0, 1, 0, 1)

pred <- prediction(observations, groups)
perf <- performance(pred, "tpr", "fpr")
plot(perf, xlab="1-Specificity", ylab="Sensitivity")

## Wilcoxon rank-sum statistic == 5
wilcox.test(observations[groups==1], observations[groups==0])
## AUC == 5/6, or U/(n1*n2)
performance(pred, "auc")@y.values[[1]]


$AUC = \frac{1}{n_1 n_2} \sum_{i \in G_1} \left(R_i - i\right) = \frac{W - \frac{n_1 (n_1 + 1)}{2}}{n_1 n_2} .$

1. 如果我们例子里的全体 $N$ 个X都被转换成了 $Y=f(x)$ ，其中$f(\cdot)$ 是一个连续单调函数，那么通过 $Y$ 算出来的 $W$ 或者 $U$ 还是 原来的值。（the invariance property）
2. 如果另外还有一组数据 $Z$$U$ 正好等于等于从 $X$ 求出的值，那么必然存在一个 连续单调变换 $g$, 使得 $g(X)=Z$.

1. #1 by xiao2er on 一月 3, 2011 - 7:55 下午

Just to let you know that there is some change recently (probably today) in your feed has messed up the format (everything shows in a chunk without paragraph or line break) and marked all posts as new in Google reader. Is this something you could fix? Thanks.

• #2 by qiuxing on 一月 4, 2011 - 5:53 上午

I xiao2er, thanks for pointing out this aberration to me! I have subscribed my own blog via Google Reader and noticed the same technical difficulty. Btw, I am a heavy google reader user, but until today I have not subscribed my own blog in my google reader :-)

What might be the cause I think, was the fact that when I migrated from the LiveJournal, WordPress assigned all my older posts as “Uncategorized” posts. This morning I took sometime and pain to manually assign the right categories to all these posts, as a result google reader might think every post was a new one.

As for the messed-up html formats, I have no idea. As you may have noticed, I use html formats and WordPress keywords (like the source code, inline LaTeX, etc) rather extensively. Some of these features might have confused google reader I guess.

2. #3 by xiao2er on 一月 4, 2011 - 7:07 下午

The new blog you posted today shows up all right in Google Reader, so there is no need to worry about the format anymore. Thanks.

3. #4 by Ling on 四月 28, 2011 - 2:41 下午

你提到：AUC和Mann-Whitney U statistic基本上是等价的

“基本上” 是指只是approximate 吗? 可以证明严格相等吗？谢谢！

• #5 by qiuxing on 五月 16, 2011 - 4:35 下午

给定两个group的sample size之后就是是严格意义上的一一对应。AUC和U statistic只相差一个乘数（U*n1*n2==AUC）而已。证明就在这篇文章里啊，呵呵。