-
Notifications
You must be signed in to change notification settings - Fork 2.4k
feat: gopls allow fuzzy matching with spelling errors #612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Before the fuzzy matching was skipped when not all characters could be matched. Now it's changed in favor of getting the score for the number of characters that match.
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
| m.roles = RuneRoles(candidate, m.rolesBuf[:]) | ||
|
|
||
| return true | ||
| return true, j |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is quite right. j is checking exact consecutive matches from the start of the pattern. If the typo appears near the beginning of the pattern (e.g. "tetsincrementalnope") the scoring won't be right since we will only consider "te" (I think?).
What happens if we just let the entire pattern score instead of trimming to the prefix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is quite right. j is checking exact consecutive matches from the start of the pattern. If the typo appears near the beginning of the pattern (e.g. "tetsincrementalnope") the scoring won't be right since we will only consider "te" (I think?).
Thanks for the comment, it made me realized match was more nuanced than i've anticipated.
j expresses the number of matching chars going through the pattern a single time, not necessarly consecutive.
So tetsincrementalnope -> j is 3, because t and e are consecutive matches, but the third t is found later in the string, so it is counts as well.
However, this 3 seems important for the score func, because if we read the resulting matrix.

it's evident we have a significant drop in score after that 3.
So i would say that scores relies on the number of matching characters we can found reading through the pattern and the candidate a single time.
So, making a spelling error with a letter that can be later found in the string is better than making a spelling error with a wrong char.
Ex. tetsincrementalnope (score 3) is better than tewtincrementalNope (score 2)
By looking at the score matrix, i don't think we can change this behavior without changing the score func as well.
I don't know if you have suggestion on how this proceed, because this change that i've made is:
- making spelling mistakes less punishing later in the string, because we are basically cutting the
patternto be matched at the first mistake (bar some exceptions discussed above)
But it is not:
- fixing spelling mistakes in general, like if we would use Levenstein
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth to get rid of this scoring mechanism since it's kind of weird with spelling mistakes, but it would be a bigger job, plus changing some people's UX because the new scoring system will be different from the current one.
| // if the candidate is short the characters have to match completely. | ||
| if len(candidate) <= shortPatternSize && j != len(m.patternLower) { | ||
| return false, 0 | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we limit sloppy matches to longer patterns? I feel like we should instead filter based on a threshold of sloppy characters (i.e allow 1 or 2 non-matching characters, maybe depending on pattern length).
I'm assuming this early return here is for performance (i.e. want to skip the expensive scoring for candidates that clearly don't match). We don't want an O(n^2) check here, but maybe we can handle a O(3n) where we backtrack a couple times to allow up to 2 non-matching characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this early return was becasue my change is allowing the score func to run more frequently, and this is just an holistic cutoff like "if there candidate to be matched is really short it must have all chars matching`.
However this is entirely up to debate, it was just to throw the idea of having a shortcut for short candidates.
I'm experimented with the idea of "at least 30% of matching characters" or something like that, but i would honestly just prefer to have a simple, clear cut-off and run with it than hitting edge cases with percentages.
As I said, i'm entirely open to change this approach once we solve the other comment's problem, which is more important!
Description
Before the fuzzy matching was skipped when not all characters could be matched. Now it's changed in favor of getting the score for the number of characters that match.
By debugging and trying to understand the current scoring mechanism I think I've understood
m.scoresis a matrix [length-candidate][length-pattern][another-dim].Before it was getting the score always from len(candidate) and len(pattern), but this is wrong in case we know the pattern has
xamount of char actually matching.So:
matchrequires full matching chars for short candidateUnit tests
I've added a few unit tests I thought showed the improvement.
Without my patch the failures in the new test cases are:
Manual QA
By trying to use the examples in the issue are satisfied, and in general it seems to match broader. I'm not entirely sure how to test false positives.
Fix: golang/go#74793