Skip to content

Conversation

@tijeco
Copy link
Contributor

@tijeco tijeco commented Apr 23, 2021

This is for #92, a port of all_iupac_variants from easy_dna.

IUPAC codes are stored as a map of rune slices. The cartesian products are then generated using a retooled version of the function found in github.com/schwarmco/go-cartesian-product, which uses recursion and go routines, so it should scale quite nicely to large sequences.

sequence.go Outdated
}
}

func allVariantsIUPAC(seq string) []string {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make public (capitalize allVariantsIUPAC -> AllVariantsIUPAC ) and create export comment starting as // AllVariantsIUPAC (insert description here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TimothyStiles Does the export comment go on the same line as the function declaration?

sequence.go Outdated

// the following functions Iter and iter, derive from github.com/schwarmco/go-cartesian-product
// which uses interfaces, so I modified it to use runes
func Iter(params ...[]rune) chan []rune {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make private.

@TimothyStiles
Copy link
Collaborator

This is awesome @tijeco! I've added a couple things and will review more later today when I get the chance.

Commenting looks good. Variables could have more descriptive names. Are there any standard test cases we could use to test it?

@tijeco
Copy link
Contributor Author

tijeco commented Apr 23, 2021

@TimothyStiles thanks! From easy_dna the sample was as follows

>>> all_iupac_variants​('ATN')
>>>​ [​'ATA'​, ​'ATC'​, ​'ATG'​, ​'ATT']

The output from the port I wrote won't be sorted, and I think because it uses go routines the order could be different each time.

So a test would either need to sort the output, or the function would need to return sorted output.

I don't know which is preferred.

@TimothyStiles
Copy link
Collaborator

TimothyStiles commented Apr 23, 2021

@tijeco In this case I think we can just sort the output for the test. There may be something we can do about it being returned from the function sorted but we don't have to do that right now.

Do you think we could find a stronger test case? Something like "The quick brown fox jumped over the lazy dog", but for degenerate base pairs? N, B, V, etc.

@tijeco
Copy link
Contributor Author

tijeco commented Apr 23, 2021

@TimothyStiles Sounds good! I will work on a test set and put it in a separate pull request for sequence_test.go

@TimothyStiles
Copy link
Collaborator

@tijeco could you put it in this pull request? Would be easier for me to manage. All you have to do is push commits to the PR in the branch and it should auto update.

@tijeco
Copy link
Contributor Author

tijeco commented Apr 23, 2021

@TimothyStiles Will do! Thanks!

I don't know if I'll have time to work on it today/this weekend, but I definitely will early next week.

@Koeng101
Copy link
Contributor

So here is a potential problem: RAM issues.

Each goroutine takes about 2Kb of memory https://stackoverflow.com/questions/22326765/go-memory-consumption-with-many-goroutines

If you have 10N, for example, this would spawn (potentially) 1,048,576 goroutines, taking up nearly 2GB of memory. Meanwhile, if you did an iterative function without goroutines, you could possibly reduce that amount of RAM by a large amount because a new goroutine wouldn't be spawned for every sequence branch. Since append functions do not take much CPU, you might get equivalent performance with a much simpler stack that takes far less RAM.

Could you try some harder benchmarks to see where the concurrent method breaks down?

@tijeco
Copy link
Contributor Author

tijeco commented Apr 29, 2021

@Koeng101 thanks for pointing that out! I'll be honest, go routines and concurrency is still a new thing to me, so I'm glad to have this opportunity to work on this.

I'll try to make a non-concurrent version of the function as well and do some benchmarks to see how memory/cpu usage varies between the two.

@TimothyStiles
Copy link
Collaborator

@tijeco can you pull down the latest from prime and merge it with your PR branch? Otherwise it'll delete some recent changes to the file. After that I should be able to run CI and do a proper review.

@tijeco
Copy link
Contributor Author

tijeco commented May 15, 2021

@TimothyStiles I think I did the thing?

Comment on lines +108 to +110
allVariants := make([][]rune, possibleVariants) // this is the 2D slice where all variants will be stored
variantHolders := make([]rune, possibleVariants*len(inList)) // this is an empty slice with a length totaling the size of all input characters
variantChoices := make([]int, len(inList)) // these will be all the possible variants
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these allocations necessary? Can you run them without?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These allocations are helping to reduce the total memory footprint. Full disclosure: this is a retool of the cartesian product from Rosetta code https://rosettacode.org/wiki/Cartesian_product_of_two_or_more_lists

@TimothyStiles
Copy link
Collaborator

@tijeco I did some finagling and fixed some bugs and the initial test itself. Think you could add an example function that will be rendered to our docs similar to this?

@TimothyStiles TimothyStiles merged commit 72be193 into bebop:prime May 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants