I have data with quite a lot of points, available on this link. Let’s call the list data. That’s what it looks like:

ListPlot[data, PlotRange -> {{0, 12}, Full}]

As you can see, there are several vertical alignments, i.e. group of points sharing almost the same xxx-coordinate. My goal is to identify such groups.

My idea was to slice the xx-axis with narrow windows and count the points inside each such window; alignments correspond to a large number of points (here, assumed larger than 60). That’s what it looks like:

data2 = Select[

Table[Select[data, Abs[#[[1]] – i] < .025 &], {i, 0, 12, .025}],
Length[#] > 60 &]

This is not too bad, but still has some problems: for example the line around x≈7.5x\approx 7.5 is composed of two groups.

How would you achieve such a task?

=================

How about kernel density estimation?

– mikado

Aug 9 at 20:54

@mikado I think that’s what C.E. used in his answer. Good idea.

– anderstood

Aug 10 at 15:55

=================

1 Answer

1

=================

A slight modification of your approach is to find the centers of the lines and find all points in the neighborhood of those centers. This will avoid the problem that the arbitrary boundaries split a single line into two because the boundary happens to be right on the line, and is not tolerant of even a small deviation. It creates the problem that if two lines are very close together, points may be counted as belonging to both lines, but this shouldn’t be a problem with the right parameters.

{x, y} = Transpose[data];

dist = SmoothKernelDistribution[x, 0.05];

peaks = Select[{#/100., #2} & @@@ FindPeaks[Table[PDF[dist, x], {x, 0, 15, 0.01}]], Last[#] > 0.1 &];

Plot[PDF[dist, x], {x, 0, 15}, Epilog -> {Red, PointSize[Large], Point[peaks]}, PlotRange -> Full]

In the previous piece of code I first created a probability density function for the x coordinate of the points. Peaks in this distribution correspond to x values where there are a lot of points, i.e. a line. I evaluate the PDF at many different values and use FindPeaks to get the local maxima because I don’t know of any direct way to find the peaks of the distribution. I then introduce a cutoff, so that only peaks larger than a certain value counts as lines. The rest of the peaks are presumed to be noise.

ListPlot[data, Epilog -> {Red, PointSize[Large], Point[{#, 80} & @@@ peaks]}]

The red dots mark what is considered to be the centers of lines. Maybe it is a bit too generous, because some of those lines are not very straight. Maybe 0.1 should be 0.11 or 0.15 instead. Nonetheless we have a line detector.

findLine[{peak_, _}] := Select[data, Abs[First[#] – peak] < 0.05 &] ListPlot[findLine /@ peaks] This last image shows the result. At least we don't have any broken lines.