Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

grouping algorithm

Options
  • 15-10-2011 12:21pm
    #1
    Registered Users Posts: 4,844 ✭✭✭


    Hi Guys
    I'm looking to write a function that accepts a sorted list of numbers and group the numbers into merged numbers under a certain limit, in the least amount of actions. Does anyone know if something like this already exists?
    Thanks!!

    here's an example:

    10 is the max group size

    list to be grouped:
    1,
    2,
    3,
    4,
    5,
    6,
    7,

    7+3
    6+4,
    5+2+1
    etc


Comments

  • Registered Users Posts: 1,216 ✭✭✭carveone


    It looks straightforward enough to produce a function that works. Whether it's maximally optimal is another story.

    I mean, you can do it by eye. Start at the biggest number and add the next biggest that doesn't go over your limit. Continue until you go over. Repeat.

    Maybe stackexchange would be a better place to ask?

    If I google I hit stuff like cluster analysis and then it gets scary and I run away :)


  • Registered Users Posts: 4,844 ✭✭✭shootermacg


    Thanks for the reply!

    I was fooling around with this for a while and I came up with this:
    The end result is a string of merged groups under the limit constraint, not optimal but does the job although in yucky vb.net.
    The whole idea for this is I need to merge files under an attachment size limit.

    If anyone has any improvements that could be made, please let me know ^ ^.


    I put 1,2,3,4,5,6,7,8,9,2,3,3,4,5,6,1,1,1,1,2,3,4 in
    and i get back (1,9)(1,8,1)(1,7,2)(1,6,3)(2,6,2)(3,5)(3,5)(3,4)(4,4)
    Private Function merge(ByVal li As List(Of Int32), ByVal limit As Int32) As String
            Dim grouping As String = ""
            While li.Count > 0
                Dim first As Int32 = li.Item(0)
                li.Remove(li.Item(0))
                grouping &= "(" & first
                Dim i As Int32
                For i = li.Count - 1 To 0 Step -1
                    If li(i) + first <= limit Then
                        grouping &= "," & li(i)
                        first += li(i)
                        li.Remove(li(i))
                        If first = limit Then
                            Exit For
                        End If
                    End If
                Next
                grouping &= ")"
            End While
            Return grouping
    
        End Function
    


  • Registered Users Posts: 1,216 ✭✭✭carveone


    I think you need code tags - it's a bit hard to follow otherwise. But yeah, like that. I would have the numbers sorted from highest to lowest. I think that would make it a lot better.

    And then loop:
    pop the first number (which will be the highest) off into the group; search for the next item that, when added to the group, does not cause overflow; pop that; repeat until reach the end of the list.
    And then repeat the whole thing again until you're out of numbers.

    Which is almost what you have except you are starting at one end and then searching from the other which looks a little odd but evidently kinda works!

    I'll take easy to implement over optimal :D


  • Registered Users Posts: 4,844 ✭✭✭shootermacg


    Yeah, burning the collection at both ends ^ ^.
    When removing items in a for loop I always take them off the end otherwise the loop is effected in bad ways.


  • Registered Users Posts: 4,844 ✭✭✭shootermacg


    latest update:
    This makes it usable for my purposes
    Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
            
    'setting up the collections, no need of any files over the limit
    
            Dim PrintJobs As New List(Of PrintJob)
            Dim PrintJobsForSplit As New List(Of PrintJob)
            Dim PrintJobsForMerge As New List(Of PrintJob)
            Dim j As Int32 = 0
            For j = 1 To 11
                PrintJobs.Add(New PrintJob(j + 1, j + 1))
            Next
    
            For j = 0 To PrintJobs.Count - 1
                If PrintJobs(j).filesize > 10 Then
                    PrintJobsForSplit.Add(PrintJobs(j))
                Else
                    PrintJobsForMerge.Add(PrintJobs(j))
                End If
            Next
    
    'sort the list by fileSize
            Dim q As List(Of PrintJob) = (From pj In PrintJobsForMerge Order By pj.filesize Select pj).ToList()
            Dim groups As List(Of List(Of PrintJob)) = merge(q, 10)
    
    
        End Sub
    
    'returns a list of lists 
        Private Function merge(ByVal li As List(Of PrintJob), ByVal limit As Int32) As List(Of List(Of PrintJob))
    
            Dim groups As New List(Of List(Of PrintJob))
            While li.Count > 0
                Dim grouping As New List(Of PrintJob)
                grouping.Add(li.Item(0))
                li.Remove(li.Item(0))
    
                Dim i As Int32
                Dim fileSize = grouping(0).filesize
                For i = li.Count - 1 To 0 Step -1
                    If li(i).filesize + fileSize <= limit Then
                        grouping.Add(li(i))
                        fileSize += li(i).filesize
                        li.Remove(li(i))
                        If fileSize = limit Then
                            Exit For
                        End If
                    End If
                Next
                groups.Add(grouping)
            End While
            Return groups
    
        End Function
    


  • Advertisement
  • Registered Users Posts: 1,922 ✭✭✭fergalr


    in the least amount of actions.

    Do you mean 'actions' in the sense of the least number of groupings to make? Or do you mean, in the most computationally efficient way?
    Or some balance?
    I.e. what are you trying to optimise for, here?

    It sounds like you've solved your problem, in practice, but if you wanted to read more, I'd start with reading about the knapsack problem; it sounds like you are dealing with a variant of it: http://en.wikipedia.org/wiki/Knapsack_problem

    carveone wrote: »
    If I google I hit stuff like cluster analysis and then it gets scary and I run away :)
    fwiw, I don't think clustering would be relevant here; if this is difficult, sounds more like its in the domain of combinatorial optimisation. http://en.wikipedia.org/wiki/Combinatorial_optimization


  • Registered Users Posts: 4,844 ✭✭✭shootermacg


    fergalr wrote: »
    Do you mean 'actions' in the sense of the least number of groupings to make? Or do you mean, in the most computationally efficient way?
    Or some balance?
    I.e. what are you trying to optimise for, here?

    It sounds like you've solved your problem, in practice, but if you wanted to read more, I'd start with reading about the knapsack problem; it sounds like you are dealing with a variant of it: http://en.wikipedia.org/wiki/Knapsack_problem

    By actions I'd would have been referring mainly to the fact that the first file in each group would be the biggest, so the files I would be appending using a writer would be the smaller files in the group. So it was nice that the groups worked out in order, meaning I don't have to re-sort each list or append big files onto small files.

    Hehe I always end up solving my problems, but I thought this problem has most likely been done to death already and like most programmers I value CPU cycles especially my own:D.

    Thanks for the knapsack tip, I think it was something to do with buckets in my day. I'll take a look at i and see if I can improve.


  • Registered Users Posts: 4,844 ✭✭✭shootermacg


    OK Last update from me, I was pretty happy with the original method until
    I noticed the groups weren't optimal, so I had to stick two list-sorts in to reverse the list, and back again not a huge performance hit compared with the time that it will take to append the file groups together, so well worth it.

    I could have used 2 lists but I prefer sorting because it's not a huge hit at all, assuming lists just use pointers.
    Now the method uses the biggest value available to fill groups.
    So groups come out like so if the limit is set to 10

    ( 10 ) ( 10 ) ( 9 1 ) ( 9 1 ) ( 8 2 ) ( 8 2 ) ( 7 3 ) ( 7 3 ) ( 6 4 ) ( 6 4 ) ( 6 4 ) ( 6 4 ) ( 6 4 ) ( 6 4 ) ( 6 4 ) ( 6 4 ) ( 5 5 ) ( 5 5 ) ( 5 5 ) ( 5 5 ) ( 2 2 2 2 2 ) ( 2 1 1 1 1 1 )

    Much nicer than the old way as you can see a lot more hitting the limit leaving the smaller numbers until last if possible:

    ( 10 ) ( 10 ) ( 9 1 ) ( 9 1 ) ( 8 1 1 ) ( 8 1 1 ) ( 7 1 2 ) ( 7 2 ) ( 6 2 2 ) ( 6 2 2 ) ( 6 2 2 ) ( 6 3 ) ( 6 3 ) ( 6 4 ) ( 6 4 ) ( 6 4 ) ( 5 4 ) ( 5 4 ) ( 5 4 ) ( 5 4 ) ( 5 4 ) ( 5 5 ) ( 5 )

    Thanks guys for the tips!
    Here's the final version in case someone else needs it.:)
    Private Function GroupFiles(ByVal li As List(Of PrintJob), ByVal limit As Int32) As List(Of List(Of PrintJob))
    
            Dim groups As New List(Of List(Of PrintJob))
            While li.Count > 0
                'need to resort the list here
                li.Sort(Function(x, y) y.filesize.CompareTo(x.filesize))
                Dim grouping As New List(Of PrintJob)
                grouping.Add(li.Item(0))
                li.Remove(li.Item(0))
    
                Dim i As Int32
                Dim fileSize = grouping(0).filesize
                'reverse the order to allow a countdown from the biggest value
                li.Sort(Function(x, y) x.filesize.CompareTo(y.filesize))
                For i = li.Count - 1 To 0 Step -1
                    If li(i).filesize + fileSize <= limit Then
                        grouping.Add(li(i))
                        fileSize += li(i).filesize
                        li.Remove(li(i))
                        If fileSize = limit Then
                            Exit For
                        End If
                    End If
                Next
                groups.Add(grouping)
            End While
            Return groups
        End Function
    
    


  • Registered Users Posts: 1,922 ✭✭✭fergalr


    OK Last update from me, I was pretty happy with the original method until
    I noticed the groups weren't optimal, so I had to stick two list-sorts in to reverse the list, and back again not a huge performance hit compared with the time that it will take to append the file groups together, so well worth it.

    I could have used 2 lists but I prefer sorting because it's not a huge hit at all, assuming lists just use pointers.
    Now the method uses the biggest value available to fill groups.
    So groups come out like so if the limit is set to 10

    ( 10 ) ( 10 ) ( 9 1 ) ( 9 1 ) ( 8 2 ) ( 8 2 ) ( 7 3 ) ( 7 3 ) ( 6 4 ) ( 6 4 ) ( 6 4 ) ( 6 4 ) ( 6 4 ) ( 6 4 ) ( 6 4 ) ( 6 4 ) ( 5 5 ) ( 5 5 ) ( 5 5 ) ( 5 5 ) ( 2 2 2 2 2 ) ( 2 1 1 1 1 1 )

    Much nicer than the old way as you can see a lot more hitting the limit leaving the smaller numbers until last if possible:

    ( 10 ) ( 10 ) ( 9 1 ) ( 9 1 ) ( 8 1 1 ) ( 8 1 1 ) ( 7 1 2 ) ( 7 2 ) ( 6 2 2 ) ( 6 2 2 ) ( 6 2 2 ) ( 6 3 ) ( 6 3 ) ( 6 4 ) ( 6 4 ) ( 6 4 ) ( 5 4 ) ( 5 4 ) ( 5 4 ) ( 5 4 ) ( 5 4 ) ( 5 5 ) ( 5 )

    Thanks guys for the tips!
    Here's the final version in case someone else needs it.:)
    Private Function GroupFiles(ByVal li As List(Of PrintJob), ByVal limit As Int32) As List(Of List(Of PrintJob))
    
            Dim groups As New List(Of List(Of PrintJob))
            While li.Count > 0
                'need to resort the list here
                li.Sort(Function(x, y) y.filesize.CompareTo(x.filesize))
                Dim grouping As New List(Of PrintJob)
                grouping.Add(li.Item(0))
                li.Remove(li.Item(0))
    
                Dim i As Int32
                Dim fileSize = grouping(0).filesize
                'reverse the order to allow a countdown from the biggest value
                li.Sort(Function(x, y) x.filesize.CompareTo(y.filesize))
                For i = li.Count - 1 To 0 Step -1
                    If li(i).filesize + fileSize <= limit Then
                        grouping.Add(li(i))
                        fileSize += li(i).filesize
                        li.Remove(li(i))
                        If fileSize = limit Then
                            Exit For
                        End If
                    End If
                Next
                groups.Add(grouping)
            End While
            Return groups
        End Function
    
    


    Disclaimers:
    1) I'm in a hurry here, so may make mistakes
    2) I don't know that language


    A) Algorithm:
    I can see the algorithm you are implementing.

    If I give you the numbers: (4,4,3,3,3,3), you are going to produce: (4,4),(3,3,3),(3)
    when you could produce: (4,3,3),(4,3,3)
    In other words, you make 3 groups, when you could make 2.
    That's just an example, of the general issue with the algorithm - it will find sub-optimal solutions.

    Are you cool with that?
    If you are, that's fine - but if your requirements are more strictly to find the least number of groups, you will need a more expensive algorithm.

    The 'greedy' search you do there, might work fine in practice - to investigate this, you'd take some real world data, and profile the amount of instances the simple greedy algorithm you wrote gets a good result, vs a more complex algorithm, that always gets the best result, and you'd see how many extra groups happened, and decide, based on your application, whether you care.

    Maybe what you have is fine, for what you want to do.

    B) Implementation:
    I think the implementation is a little scrappy.
    Why do you do so much sorting?
    You aren't adding anything to the list, right?
    If all you are doing is deleting elements from the list, the relative order of the remaining items in the list isn't going to change, right? So why re-sort?
    If you are using something like Insert sort (I don't know the underlying sorting algorithm) you won't see much of a penalty. But, if the library algorithm is something like quicksort (quite possible) then you are going to get stung for amortised n*log(n) every time you do the sort - because qsort isn't fast when run on already sorted data.

    So, lets say n is the number of elements in 'li'

    Your outer loop runs n times, as it gets smaller; so is n/2 on average.
    Inside, you do 2 potentially n*log(n) sorts; so you are at about n^2 * log(n) - basically, you've got an n-squared implementation.

    As the size of 'li' gets big, the implementation is going to slow down a lot, basically.

    Maybe you'll never have big n - but its worth not coding an n-squared algorithm if its easy not to (because everyone thinks "I'll never have many items, and then gets stung later, when someone uses the code - or they use the code - somewhere else.).
    So, if you just do the sorting once, outside the 'While li.Count > 0' loop, then then just delete elements from the list appropriately, as you consume them (or set them to be some dummy sentinel value - not sure what that language allows) you can make a much faster implementation. (i.e. inside, when you want to get the smallest element of the list - don't re-sort the list, in the opposite order - if the list is sorted so that the largest elements are at the front, then the smallest ones are at the back.)


  • Registered Users Posts: 4,844 ✭✭✭shootermacg


    fergalr wrote: »
    Disclaimers:
    1) I'm in a hurry here, so may make mistakes
    2) I don't know that language


    A) Algorithm:
    I can see the algorithm you are implementing.

    If I give you the numbers: (4,4,3,3,3,3), you are going to produce: (4,4),(3,3,3),(3)
    when you could produce: (4,3,3),(4,3,3)
    In other words, you make 3 groups, when you could make 2.
    That's just an example, of the general issue with the algorithm - it will find sub-optimal solutions.

    Are you cool with that?
    If you are, that's fine - but if your requirements are more strictly to find the least number of groups, you will need a more expensive algorithm.

    The 'greedy' search you do there, might work fine in practice - to investigate this, you'd take some real world data, and profile the amount of instances the simple greedy algorithm you wrote gets a good result, vs a more complex algorithm, that always gets the best result, and you'd see how many extra groups happened, and decide, based on your application, whether you care.

    Maybe what you have is fine, for what you want to do.

    B) Implementation:
    I think the implementation is a little scrappy.
    Why do you do so much sorting?
    You aren't adding anything to the list, right?
    If all you are doing is deleting elements from the list, the relative order of the remaining items in the list isn't going to change, right? So why re-sort?
    If you are using something like Insert sort (I don't know the underlying sorting algorithm) you won't see much of a penalty. But, if the library algorithm is something like quicksort (quite possible) then you are going to get stung for amortised n*log(n) every time you do the sort - because qsort isn't fast when run on already sorted data.

    So, lets say n is the number of elements in 'li'

    Your outer loop runs n times, as it gets smaller; so is n/2 on average.
    Inside, you do 2 potentially n*log(n) sorts; so you are at about n^2 * log(n) - basically, you've got an n-squared implementation.

    As the size of 'li' gets big, the implementation is going to slow down a lot, basically.

    Maybe you'll never have big n - but its worth not coding an n-squared algorithm if its easy not to (because everyone thinks "I'll never have many items, and then gets stung later, when someone uses the code - or they use the code - somewhere else.).
    So, if you just do the sorting once, outside the 'While li.Count > 0' loop, then then just delete elements from the list appropriately, as you consume them (or set them to be some dummy sentinel value - not sure what that language allows) you can make a much faster implementation. (i.e. inside, when you want to get the smallest element of the list - don't re-sort the list, in the opposite order - if the list is sorted so that the largest elements are at the front, then the smallest ones are at the back.)

    Thanks for the input.
    All very good points indeed!
    I've run this 10,000 times, so the sort doesn't seem to impact too much, but I take the point and it just looks like a simple case of reversing the outer loop originally, so consider them history:)

    fergalr wrote: »
    If I give you the numbers: (4,4,3,3,3,3), you are going to produce: (4,4),(3,3,3),(3)
    when you could produce: (4,3,3),(4,3,3)


    Especially liked the conclusion:
    This is a greedy effort and it would be far better to have less groups rather than groups exactly on the limit, but the algorithm for creating groups like the ones in your example, just isn't coming to me. I'll have a rethink over the next few days and hopefully come up with an alternative, good thing about code is, I can always replace the function, if I do manage to figure out a solution.

    I doubt anyone in work even the peer reviewer will be bothered by the method but I personally want it to be the best it can be.

    The language is vb.net, it's one of our legacy apps I'm afraid, c# is much nicer to look at.

    Thanks again! Very clever! :D


  • Advertisement
Advertisement