Add doc for english LM. by pkuyym · Pull Request #295 · PaddlePaddle/models

pkuyym · 2017-09-19T08:45:41Z

pkuyym · 2017-09-19T08:57:46Z

deep_speech_2/README.md

+  * Repeated whitespace are squeezed to one and the beginning whitespace are removed. Notice that all transcriptions are lowercase, so all characters are converted to lowercases.
+  * Top 400000 words by frequency are selected to build the vocabulary and all words not in the vocabulary are replaced with 'UNKNOWNWORD'.
+
+Now the preprocessing is done and we get a clean corpus to train the language model. Our released language model are pruned by '0 1 1 1 1'. To save disk storage we convert the arpa file to 'trie' binary file with parameters '-a 22 -q 8 -b 8'.


Our released language model are pruned by '0 1 1 1 1' and the max order of n-gram is 5.

xinghai-sun · 2017-09-19T08:48:57Z

deep_speech_2/README.md

+
+The english corpus is from the [Common Crawl Repository](http://commoncrawl.org) and you can download it from [statmt](http://data.statmt.org/ngrams/deduped_en). We use part en.00 to train our english languge model. There are some preprocessing steps before training:
+
+  * Characters which not in [A-Za-z0-9\s'] are removed and arabic numbers are converted to english numbers like 1000 to one thousand.


remove which
why \s ? whitespace?

xinghai-sun · 2017-09-19T08:50:45Z

deep_speech_2/README.md

+The english corpus is from the [Common Crawl Repository](http://commoncrawl.org) and you can download it from [statmt](http://data.statmt.org/ngrams/deduped_en). We use part en.00 to train our english languge model. There are some preprocessing steps before training:
+
+  * Characters which not in [A-Za-z0-9\s'] are removed and arabic numbers are converted to english numbers like 1000 to one thousand.
+  * Repeated whitespace are squeezed to one and the beginning whitespace are removed. Notice that all transcriptions are lowercase, so all characters are converted to lowercases.


--> beginning and trailing ?
whitespace 的复数形式是？
两处lowercase的单复数形式？

trailing的whitespace影响不大，比如句尾的换行符，没有必要去掉

xinghai-sun · 2017-09-19T08:55:27Z

deep_speech_2/README.md

+
+  * Characters which not in [A-Za-z0-9\s'] are removed and arabic numbers are converted to english numbers like 1000 to one thousand.
+  * Repeated whitespace are squeezed to one and the beginning whitespace are removed. Notice that all transcriptions are lowercase, so all characters are converted to lowercases.
+  * Top 400000 words by frequency are selected to build the vocabulary and all words not in the vocabulary are replaced with 'UNKNOWNWORD'.


---> Top 400,000 most frequent words are ....and the rest are replaced with ...

xinghai-sun · 2017-09-19T08:59:08Z

deep_speech_2/README.md

+  * Repeated whitespace are squeezed to one and the beginning whitespace are removed. Notice that all transcriptions are lowercase, so all characters are converted to lowercases.
+  * Top 400000 words by frequency are selected to build the vocabulary and all words not in the vocabulary are replaced with 'UNKNOWNWORD'.
+
+Now the preprocessing is done and we get a clean corpus to train the language model. Our released language model are pruned by '0 1 1 1 1'. To save disk storage we convert the arpa file to 'trie' binary file with parameters '-a 22 -q 8 -b 8'.


”0 1 1 1 1“ 不熟悉kenlm的人看比较费解，能给出是哪个参数并稍微解释下吗。"-a -q -b"同上。

xinghai-sun · 2017-09-19T09:01:03Z

deep_speech_2/README.md


+Here we provide some tips to show how we prepearing our english and mandarin language models.
+
+#### English LM


Prepare Your Own English LM ? 以和前面区分开来？
并且在前面部分的最开始加上 #### Download LMs

Add doc for english LM.

35034a3

pkuyym requested a review from xinghai-sun September 19, 2017 08:45

pkuyym commented Sep 19, 2017

View reviewed changes

xinghai-sun requested changes Sep 19, 2017

View reviewed changes

xinghai-sun approved these changes Sep 19, 2017

View reviewed changes

Refine doc.

2cff5b5

pkuyym merged commit e99d6f4 into PaddlePaddle:develop Sep 19, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add doc for english LM.#295

Add doc for english LM.#295
pkuyym merged 2 commits intoPaddlePaddle:developfrom
pkuyym:fix-294

pkuyym commented Sep 19, 2017

pkuyym Sep 19, 2017

xinghai-sun Sep 19, 2017

xinghai-sun Sep 19, 2017

pkuyym Sep 19, 2017

xinghai-sun Sep 19, 2017

xinghai-sun Sep 19, 2017

xinghai-sun Sep 19, 2017

Labels

2 participants


		The english corpus is from the [Common Crawl Repository](http://commoncrawl.org) and you can download it from [statmt](http://data.statmt.org/ngrams/deduped_en). We use part en.00 to train our english languge model. There are some preprocessing steps before training:

		* Characters which not in [A-Za-z0-9\s'] are removed and arabic numbers are converted to english numbers like 1000 to one thousand.


		Here we provide some tips to show how we prepearing our english and mandarin language models.

		#### English LM

Conversation

pkuyym commented Sep 19, 2017

pkuyym Sep 19, 2017

Choose a reason for hiding this comment

xinghai-sun Sep 19, 2017

Choose a reason for hiding this comment

xinghai-sun Sep 19, 2017

Choose a reason for hiding this comment

pkuyym Sep 19, 2017

Choose a reason for hiding this comment

xinghai-sun Sep 19, 2017

Choose a reason for hiding this comment

xinghai-sun Sep 19, 2017

Choose a reason for hiding this comment

xinghai-sun Sep 19, 2017

Choose a reason for hiding this comment

Labels

2 participants