Merged
Conversation
pkuyym
commented
Sep 19, 2017
deep_speech_2/README.md
Outdated
| * Repeated whitespace are squeezed to one and the beginning whitespace are removed. Notice that all transcriptions are lowercase, so all characters are converted to lowercases. | ||
| * Top 400000 words by frequency are selected to build the vocabulary and all words not in the vocabulary are replaced with 'UNKNOWNWORD'. | ||
|
|
||
| Now the preprocessing is done and we get a clean corpus to train the language model. Our released language model are pruned by '0 1 1 1 1'. To save disk storage we convert the arpa file to 'trie' binary file with parameters '-a 22 -q 8 -b 8'. |
Contributor
Author
There was a problem hiding this comment.
Our released language model are pruned by '0 1 1 1 1' and the max order of n-gram is 5.
xinghai-sun
requested changes
Sep 19, 2017
deep_speech_2/README.md
Outdated
|
|
||
| The english corpus is from the [Common Crawl Repository](http://commoncrawl.org) and you can download it from [statmt](http://data.statmt.org/ngrams/deduped_en). We use part en.00 to train our english languge model. There are some preprocessing steps before training: | ||
|
|
||
| * Characters which not in [A-Za-z0-9\s'] are removed and arabic numbers are converted to english numbers like 1000 to one thousand. |
Contributor
There was a problem hiding this comment.
remove which
why \s ? whitespace?
deep_speech_2/README.md
Outdated
| The english corpus is from the [Common Crawl Repository](http://commoncrawl.org) and you can download it from [statmt](http://data.statmt.org/ngrams/deduped_en). We use part en.00 to train our english languge model. There are some preprocessing steps before training: | ||
|
|
||
| * Characters which not in [A-Za-z0-9\s'] are removed and arabic numbers are converted to english numbers like 1000 to one thousand. | ||
| * Repeated whitespace are squeezed to one and the beginning whitespace are removed. Notice that all transcriptions are lowercase, so all characters are converted to lowercases. |
Contributor
There was a problem hiding this comment.
--> beginning and trailing ?
whitespace 的复数形式是?
两处lowercase的单复数形式?
Contributor
Author
There was a problem hiding this comment.
trailing的whitespace影响不大,比如句尾的换行符,没有必要去掉
deep_speech_2/README.md
Outdated
|
|
||
| * Characters which not in [A-Za-z0-9\s'] are removed and arabic numbers are converted to english numbers like 1000 to one thousand. | ||
| * Repeated whitespace are squeezed to one and the beginning whitespace are removed. Notice that all transcriptions are lowercase, so all characters are converted to lowercases. | ||
| * Top 400000 words by frequency are selected to build the vocabulary and all words not in the vocabulary are replaced with 'UNKNOWNWORD'. |
Contributor
There was a problem hiding this comment.
---> Top 400,000 most frequent words are ....and the rest are replaced with ...
deep_speech_2/README.md
Outdated
| * Repeated whitespace are squeezed to one and the beginning whitespace are removed. Notice that all transcriptions are lowercase, so all characters are converted to lowercases. | ||
| * Top 400000 words by frequency are selected to build the vocabulary and all words not in the vocabulary are replaced with 'UNKNOWNWORD'. | ||
|
|
||
| Now the preprocessing is done and we get a clean corpus to train the language model. Our released language model are pruned by '0 1 1 1 1'. To save disk storage we convert the arpa file to 'trie' binary file with parameters '-a 22 -q 8 -b 8'. |
Contributor
There was a problem hiding this comment.
”0 1 1 1 1“ 不熟悉kenlm的人看比较费解,能给出是哪个参数并稍微解释下吗。"-a -q -b"同上。
|
|
||
| Here we provide some tips to show how we prepearing our english and mandarin language models. | ||
|
|
||
| #### English LM |
Contributor
There was a problem hiding this comment.
Prepare Your Own English LM ? 以和前面区分开来?
并且在前面部分的最开始加上 #### Download LMs
xinghai-sun
approved these changes
Sep 19, 2017
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
fix #294