-
-
Notifications
You must be signed in to change notification settings - Fork 73
Seqhash v1 #82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seqhash v1 #82
Conversation
|
@TimothyStiles I added in a long-form description of the Seqhash algorithm. Please check it over and let me know what you think. |
|
Shoot, wrong branch - fixing (eh, it'll merge) |
Koeng101
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ready for review 2 @TimothyStiles
| minimal sequence is taken (whether or not the min or max is used doesn't matter, just needs to | ||
| be consistent). | ||
|
|
||
| If the sequence is RNA, the sequence will be converted to DNA before hashing. While the full Seqhash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<3
|
|
||
| Seqhash stuff starts here. | ||
|
|
||
| There is a big problem with current sequence databases - they all use different |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole description is excellent.
|
I just want to recognize how excellent this pull request was. Great commenting. I understand what everything does and why. Names are perfect. SUPER A+ KEONI! |
|
Thank you! |
|
|
||
| The Seqhash algorithm makes several opinionated design choices, primarily to make working | ||
| with Seqhashes more consistent and nice. The Seqhash algorithm only uses a single hash function, | ||
| Blake3, and only operates on DNA, RNA, and Protein sequences. These identifiers will be seen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is a single hash function/Blake3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A single hash function is a hash function in contrast to hash functions, ie, singular instead of plural.
https://github.com/BLAKE3-team/BLAKE3 BLAKE3 is a fast cryptographic hash function
This PR adds the Seqhash function and SequenceSeqhash method to Sequence structs.
Briefly, the algorithm uniquely hashes DNA, RNA, and Protein sequences. It checks for properly formed sequences (no X nucleotide or amino acids) and has full test coverage with written examples.
Seqhash's primary use case is navigation between different databases. In particular, they can be used to do a full comparison of Genbank <-> Uniprot to get EC numbers for any protein.