18

Let's assume we have a large list of points List<Point> pointList (already stored in memory) where each Point contains X, Y, and Z coordinate.

Now, I would like to select for example N% of points with biggest Z-values of all points stored in pointList. Right now I'm doing it like that:

N = 0.05; // selecting only 5% of points
double cutoffValue = pointList
    .OrderBy(p=> p.Z) // First bottleneck - creates sorted copy of all data
    .ElementAt((int) pointList.Count * (1 - N)).Z;

List<Point> selectedPoints = pointList.Where(p => p.Z >= cutoffValue).ToList();

But I have here two memory usage bottlenecks: first during OrderBy (more important) and second during selecting the points (this is less important, because we usually want to select only small amount of points).

Is there any way of replacing OrderBy (or maybe other way of finding this cutoff point) with something that uses less memory?

The problem is quite important, because LINQ copies the whole dataset and for big files I'm processing it sometimes hits few hundreds of MBs.

5
  • 1
    Seems to me that the bigger problem here is that you're trying to manipulate so many millions of data points in memory. Why aren't you using some sort of database? Commented Jul 25, 2010 at 16:39
  • @Aaronaught - I agree most databases will take care of this without a second thought. (Excluding Informix, which will take many thoughts with a large possibility of corrupting your data.) Commented Jul 25, 2010 at 16:42
  • @Aaronaught, unfortunately, I'm creating only a part of larger project and I'm getting the list already stored in memory. So it is crucial to avoid copying it. Commented Jul 25, 2010 at 16:50
  • 2
    I just had to say your question has brought up some very interesting answers. Commented Jul 25, 2010 at 17:03
  • @ChaosPandion: you're right. Writing this question I didn't expected it would gain such attention and provide so many interesting solutions ans suggestions. Commented Jul 25, 2010 at 18:59

11 Answers 11

5

Write a method that iterates through the list once and maintains a set of the M largest elements. Each step will only require O(log M) work to maintain the set, and you can have O(M) memory and O(N log M) running time.

public static IEnumerable<TSource> TakeLargest<TSource, TKey>
    (this IEnumerable<TSource> items, Func<TSource, TKey> selector, int count)
{
    var set = new SortedDictionary<TKey, List<TSource>>();
    var resultCount = 0;
    var first = default(KeyValuePair<TKey, List<TSource>>);
    foreach (var item in items)
    {
        // If the key is already smaller than the smallest
        // item in the set, we can ignore this item
        var key = selector(item);
        if (first.Value == null ||
            resultCount < count ||
            Comparer<TKey>.Default.Compare(key, first.Key) >= 0)
        {
            // Add next item to set
            if (!set.ContainsKey(key))
            {
                set[key] = new List<TSource>();
            }
            set[key].Add(item);
            if (first.Value == null)
            {
                first = set.First();
            }

            // Remove smallest item from set
            resultCount++;
            if (resultCount - first.Value.Count >= count)
            {
                set.Remove(first.Key);
                resultCount -= first.Value.Count;
                first = set.First();
            }
        }
    }
    return set.Values.SelectMany(values => values);
}

That will include more than count elements if there are ties, as your implementation does now.

Sign up to request clarification or add additional context in comments.

10 Comments

I just tried a similar approach. It works, and certainly uses much less memory, but it's also MUCH slower... takes about 10 seconds for a list of 100000 points (about 100ms with the OP's current approach). So it's a compromise between memory usage and speed...
@Thomas - this can be written to beat the ops approach if tweaked to track the min value in the working set and use a sorted array/heap (insert sort time) rather than sorted dictionary.
@Thomas: Yes, the SortedDictionary is a pretty slow implementation of a priority queue. I think you can get much better performance out of this with a proper binary heap, but there isn't one built into the framework and I didn't want to write one here.
Funny thing, this solution seems to use 3 times more memory, than the original solution. Mesured with CLR Profiler.
@Gacek: That's probably because of all the List<TSource> objects being created. They should be collected in gen 0, but you'll still see allocations. Let me update my answer to skip that step if the list would immediately be removed.
|
3

You could sort the list in place, using List<T>.Sort, which uses the Quicksort algorithm. But of course, your original list would be sorted, which is perhaps not what you want...

pointList.Sort((a, b) => b.Z.CompareTo(a.Z));
var selectedPoints = pointList.Take((int)(pointList.Count * N)).ToList();

If you don't mind the original list being sorted, this is probably the best balance between memory usage and speed

1 Comment

Thanks, Thomas. I do not mind if the original list would be sorted. But one remark: you should replace a and b in your sorting routine to sort in descending order.
1

You can use Indexed LINQ to put index on the data which you are processing. This can result in noticeable improvements in some cases.

Comments

1

If you combine the two there is a chance a little less work will be done:

List<Point> selectedPoints =  pointList
    .OrderByDescending(p=> p.Z) // First bottleneck - creates sorted copy of all data
    .Take((int) pointList.Count * N);

But basically this kind of ranking requires sorting, your biggest cost.

A few more ideas:

  • if you use a class Point (instead of a struct Point) there will be much less copying.
  • you could write a custom sort that only bothers to move the top 5% up. Something like (don't laugh) BubbleSort.

1 Comment

Thanks for your helpful suggestions. And the idea with bubblesort is neat :) For small portions of data needed it could be quicker. I have second, similar but a little bit more complicated problem and maybe I will use it.
1

If your list is in memory already, I would sort it in place instead of making a copy - unless you need it un-sorted again, that is, in which case you'll have to weigh having two copies in memory vs loading it again from storage):

pointList.Sort((x,y) => y.Z.CompareTo(x.Z)); //this should sort it in desc. order

Also, not sure how much it will help, but it looks like you're going through your list twice - once to find the cutoff value, and once again to select them. I assume you're doing that because you want to let all ties through, even if it means selecting more than 5% of the points. However, since they're already sorted, you can use that to your advantage and stop when you're finished.

double cutoffValue = pointlist[(int) pointList.Length * (1 - N)].Z;
List<point> selectedPoints = pointlist.TakeWhile(p => p.Z >= cutoffValue)
                                      .ToList();

Comments

1

Unless your list is extremely large, it's much more likely to me that cpu time is your performance bottleneck. Yes, your OrderBy() might use a lot of memory, but it's generally memory that for the most part is otherwise sitting idle. The cpu time really is the bigger concern.

To improve cpu time, the most obvious thing here is to not use a list. Use an IEnumerable instead. You do this by simply not calling .ToList() at the end of your where query. This will allow the framework to combine everything into one iteration of the list that runs only as needed. It will also improve your memory use because it avoids loading the entire query into memory at once, and instead defers it to only load one item at a time as needed. Also, use .Take() rather than .ElementAt(). It's a lot more efficient.

double N = 0.05; // selecting only 5% of points
int count = (1-N) * pointList.Count;
var selectedPoints = pointList.OrderBy(p=>p.Z).Take(count);

That out of the way, there are three cases where memory use might actually be a problem:

  1. Your collection really is so large as to fill up memory. For a simple Point structure on a modern system we're talking millions of items. This is really unlikely. On the off chance you have a system this large, your solution is to use a relational database, which can keep this items on disk relatively efficiently.
  2. You have a moderate size collection, but there are external performance constraints, such as needing to share system resources with many other processes as you might find in an asp.net web site. In this case, the answer is either to 1) again put the points in a relational database or 2) offload the work to the client machines.
  3. Your collection is just large enough to end up on the Large Object Heap, and the HashSet used in the OrderBy() call is also placed on the LOH. Now what happens is that the garbage collector will not properly compact memory after your OrderBy() call, and over time you get a lot of memory that is not used but still reserved by your program. In this case, the solution is, unfortunately, to break your collection up into multiple groups that are each individually small enough not to trigger use of the LOH.

Update:

Reading through your question again, I see you're reading very large files. In that case, the best performance can be obtained by writing your own code to parse the files. If the count of items is stored near the top of the file you can do much better, or even if you can estimate the number of records based on the size of the file (guess a little high to be sure, and then truncate any extras after finishing), you can then build your final collection as your read. This will greatly improve cpu performance and memory use.

Comments

1

I'd do it by implementing "half" a quicksort.

Consider your original set of points, P, where you are looking for the "top" N items by Z coordinate.

Choose a pivot x in P.

Partition P into L = {y in P | y < x} and U = {y in P | x <= y}.

If N = |U| then you're done.

If N < |U| then recurse with P := U.

Otherwise you need to add some items to U: recurse with N := N - |U|, P := L to add the remaining items.

If you choose your pivot wisely (e.g., median of, say, five random samples) then this will run in O(n log n) time.

Hmmmm, thinking some more, you may be able to avoid creating new sets altogether, since essentially you're just looking for an O(n log n) way of finding the Nth greatest item from the original set. Yes, I think this would work, so here's suggestion number 2:

Make a traversal of P, finding the least and greatest items, A and Z, respectively.

Let M be the mean of A and Z (remember, we're only considering Z coordinates here).

Count how many items there are in the range [M, Z], call this Q.

If Q < N then the Nth greatest item in P is somewhere in [A, M). Try M := (A + M)/2.

If N < Q then the Nth greatest item in P is somewhere in [M, Z]. Try M := (M + Z)/2.

Repeat until we find an M such that Q = N.

Now traverse P, removing all items greater than or equal to M.

That's definitely O(n log n) and creates no extra data structures (except for the result). Howzat?

1 Comment

Just tried this and it works a treat. On my PC, using generic code on an unsorted array of 65,000 doubles takes about 20ms; an unsorted array of 1,000,000 doubles takes about 500ms. However, you can bypass virtually all of that cost if you can keep your data sorted in the first place.
0

You might use something like this:

pointList.Sort(); // Use you own compare here if needed

// Skip OrderBy because the list is sorted (and not copied)
double cutoffValue = pointList.ElementAt((int) pointList.Length * (1 - N)).Z; 

// Skip ToList to avoid another copy of the list
IEnumerable<Point> selectedPoints = pointList.Where(p => p.Z >= cutoffValue); 

Comments

0

If you want a small percentage of points ordered by some criterion, you'll be better served using a Priority queue data structure; create a size-limited queue(with the size set to however many elements you want), and then just scan through the list inserting every element. After the scan, you can pull out your results in sorted order.
This has the benefit of being O(n log p) instead of O(n log n) where p is the number of points you want, and the extra storage cost is also dependent on your output size instead of the whole list.

Comments

0
int resultSize = pointList.Count * (1-N);
FixedSizedPriorityQueue<Point> q =
  new FixedSizedPriorityQueue<Point>(resultSize, p => p.Z);
q.AddEach(pointList);
List<Point> selectedPoints = q.ToList();

Now all you have to do is implement a FixedSizedPriorityQueue that adds elements one at a time and discards the largest element when it is full.

Comments

-1

You wrote, you are working with a DataSet. If so, you can use DataView to sort your data once and use them for all future accessing the rows.

Just tried with 50,000 rows and 100 times accessing 30% of them. My performance results are:

  1. Sort With Linq: 5.3 seconds
  2. Use DataViews: 0.01 seconds

Give it a try.

   [TestClass]
   public class UnitTest1 {
      class MyTable : TypedTableBase<MyRow> {
         public MyTable() {
            Columns.Add("Col1", typeof(int));
            Columns.Add("Col2", typeof(int));
         }

         protected override DataRow NewRowFromBuilder(DataRowBuilder builder) {
            return new MyRow(builder);
         }
      }

      class MyRow : DataRow {
         public MyRow(DataRowBuilder builder) : base(builder) {
         }

         public int Col1 { get { return (int)this["Col1"]; } }
         public int Col2 { get { return (int)this["Col2"]; } }
      }

      DataView _viewCol1Asc;
      DataView _viewCol2Desc;
      MyTable _table;
      int _countToTake;

      [TestMethod]
      public void MyTestMethod() {
         _table = new MyTable();


         int count = 50000;
         for (int i = 0; i < count; i++) {
            _table.Rows.Add(i, i);
         }

         _countToTake = _table.Rows.Count / 30;
         Console.WriteLine("SortWithLinq");
         RunTest(SortWithLinq);
         Console.WriteLine("Use DataViews");
         RunTest(UseSoredDataViews);
      }

      private void RunTest(Action method) {
         int iterations = 100;
         Stopwatch watch = Stopwatch.StartNew();
         for (int i = 0; i < iterations; i++) {
            method();
         }
         watch.Stop();
         Console.WriteLine("   {0}", watch.Elapsed);
      }

      private void UseSoredDataViews() {
         if (_viewCol1Asc == null) {
            _viewCol1Asc = new DataView(_table, null, "Col1 ASC", DataViewRowState.Unchanged);
            _viewCol2Desc = new DataView(_table, null, "Col2 DESC", DataViewRowState.Unchanged);
         }

         var rows = _viewCol1Asc.Cast<DataRowView>().Take(_countToTake).Select(vr => (MyRow)vr.Row);
         IterateRows(rows);
         rows = _viewCol2Desc.Cast<DataRowView>().Take(_countToTake).Select(vr => (MyRow)vr.Row);
         IterateRows(rows);
      }

      private void SortWithLinq() {
         var rows = _table.OrderBy(row => row.Col1).Take(_countToTake);
         IterateRows(rows);
         rows = _table.OrderByDescending(row => row.Col2).Take(_countToTake);
         IterateRows(rows);
      }

      private void IterateRows(IEnumerable<MyRow> rows) {
         foreach (var row in rows)
            if (row == null)
               throw new Exception("????");
      }
   }

1 Comment

I just tried this (because the results look strange), the DataViews do not return any rows (empty Enumerable), so yes it's fast but also useless.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.