Skip to content

Conversation

@Koeng101
Copy link
Contributor

This PR adds the Seqhash function and SequenceSeqhash method to Sequence structs.

Briefly, the algorithm uniquely hashes DNA, RNA, and Protein sequences. It checks for properly formed sequences (no X nucleotide or amino acids) and has full test coverage with written examples.

Seqhash's primary use case is navigation between different databases. In particular, they can be used to do a full comparison of Genbank <-> Uniprot to get EC numbers for any protein.

@Koeng101
Copy link
Contributor Author

Koeng101 commented Dec 3, 2020

@TimothyStiles I added in a long-form description of the Seqhash algorithm. Please check it over and let me know what you think.

@Koeng101
Copy link
Contributor Author

Koeng101 commented Dec 3, 2020

Shoot, wrong branch - fixing (eh, it'll merge)

@bebop bebop deleted a comment from codecov bot Apr 14, 2021
Copy link
Contributor Author

@Koeng101 Koeng101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready for review 2 @TimothyStiles

minimal sequence is taken (whether or not the min or max is used doesn't matter, just needs to
be consistent).

If the sequence is RNA, the sequence will be converted to DNA before hashing. While the full Seqhash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<3


Seqhash stuff starts here.

There is a big problem with current sequence databases - they all use different
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole description is excellent.

@TimothyStiles TimothyStiles merged commit acfec4a into prime Apr 16, 2021
@TimothyStiles
Copy link
Collaborator

I just want to recognize how excellent this pull request was. Great commenting. I understand what everything does and why. Names are perfect. SUPER A+ KEONI!

@TimothyStiles TimothyStiles deleted the seqhash_v1 branch April 16, 2021 20:16
@Koeng101
Copy link
Contributor Author

Thank you!


The Seqhash algorithm makes several opinionated design choices, primarily to make working
with Seqhashes more consistent and nice. The Seqhash algorithm only uses a single hash function,
Blake3, and only operates on DNA, RNA, and Protein sequences. These identifiers will be seen

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is a single hash function/Blake3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A single hash function is a hash function in contrast to hash functions, ie, singular instead of plural.

https://github.com/BLAKE3-team/BLAKE3 BLAKE3 is a fast cryptographic hash function

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

4 participants