Statistical analysis of data-list with recurring x-values

I have a list of data as follows with repeated x-values:

data={{42, 2.73}, {41, 3.5}, {41, 3.16}, {41, 2.73}, {39, 2.83}, {39,
2.66}, {47, 3.22}, {41, 2.86}, {41, 3.38}, {42, 2.62}, {39,2.32}, {46, 2.99}, {49, 2.3}, {39, 3.22}, {42, 1.53}, {49,1.46}, {49, 1.88}, {49, 1.08}, {47, 1.01}, {41, 1.17}, {40,1.3}, {46, 2.32}, {43, 1.85}, {39, 2.63}, {40, 2.72}, {49,1.9}, {48, 1.76},
{40, 1.67}, {42, 2.73}, {48, 2.97}, {43,1.81}, {41, 0.88}, {41, 2.56}, {40, 2.4},
{39, 2.08}, {49, 1.84}, {48, 2.07}, {46, 1.84}, {45, 2.24}, {44, 1.29},
{44, 2.05}, {42, 1.78}, {41, 1.59}, {40, 1.27}, {49, 3.21}, {47, 2.81}, {44, 3.1},
{43, 3.29}, {39, 3.3}, {44, 1.71}, {46, 3.08}, {47, 3.7}, {46, 3.04}, {44, 3.04},
{45, 2.88}, {49, 2.69}, {47, 1.74}, {47, 2.36}, {45, 2.94}, {44, 1.97}, {44, 2.55},
{43, 1.53}, {43, 3.91}, {42, 3.22}, {39, 1.84}, {44, 1.13}, {46, 1.48}, {42, 2.85},
{41, 4.21}, {41, 2.4}}

a) How do I find the mean + standard error for each distinct x-value, and present
the result as a graphical plot for all elements of the above list?

b) How do I determine the linear fit based on these mean values?

=================

  

 

Have a look at GatherBy.
– b.gatessucks
Apr 30 ’13 at 14:25

=================

1 Answer
1

=================

Needs[“ErrorBarPlots`”]

nd = {#[[1, 1]], Mean[#[[All, 2]]],
StandardDeviation[#[[All, 2]]]/Sqrt[Length[#[[All, 2]]]]
} & /@ GatherBy[data, #[[1]] &];

fit[x_] = LinearModelFit[data, {1, x}, x][“BestFit”]

3.525688 – 0.02640621028 x

Show[
ErrorListPlot[{nd[[All, {1, 2}]], ErrorBar /@ nd[[All, 3]]}\[Transpose],
PlotRange -> {0, 4},
Frame -> True],
Plot[fit[x], {x, 39, 50}]
]

or perhaps this:

gb = SortBy[GatherBy[data, #[[1]] &], #[[1, 1]] &];
BoxWhiskerChart[#[[All, 2]] & /@ gb, ChartLabels -> (#[[1, 1]] & /@ gb)]

1

 

There are multiple problems with this fitting approach. First, the weights should be inversely proportional to the variances of the means rather than of the data themselves. (You need to divide the variances by the counts.) This means the error bars at the beginning are incorrect, too. Second, the variances need to be computed as population variances rather than the sample estimates: otherwise, you will be in real trouble if an x-value is not repeated! Third, why bother when the result ought to be the same as if the fit were made with the original data?
– whuber
Apr 30 ’13 at 17:58

  

 

@whuber Agreed. Will update ASAP.
– Sjoerd C. de Vries
Apr 30 ’13 at 20:16

  

 

@whuber Updated. I kept the sample estimate for the moment (your second point). Wouldn’t using the population estimate yield a biased result?
– Sjoerd C. de Vries
Apr 30 ’13 at 21:13

1

 

I don’t think it will be biased–I believe you’re thinking of something akin to ANOVA with the x-values as categories. In your answer you are treating those x-values as numbers. Intuitively, if you tried to apply the Bessel correction to your variance estimates you couldn’t even get a variance for any x-value that isn’t replicated. Obviously that’s not right: for fitting the straight line it’s possible to include single values along with averages of two or more. (It’s hard to estimate the variances of non-replicated values, though!) (+1 for the edits, BTW.)
– whuber
Apr 30 ’13 at 21:59

1

 

@whuber Apologies for dragging this on, but I’d like to have this clear for myself: If there are sufficient samples (>1) wouldn’t you then not need to use the sample variance? To be sure, I’m not talking about the fit part of this question, but about the size of the error bars. Or is the problem that the square root of the unbiased variance is biased?
– Sjoerd C. de Vries
Apr 30 ’13 at 22:22