Conversation
| schduled_type: is the type of the decay. It supports constant, linear, | ||
| exponential, and inverse_sigmoid right now. | ||
| a: parameter of the decay (MUST BE DOUBLE) | ||
| b: parameter of the decay (MUST BE DOUBLE) |
There was a problem hiding this comment.
12-15行的注释放到18行init函数下面,因为这三个参数在init的时候才出现。
| Get the schedule sampling rate. Usually not needed to be called by the users | ||
| ''' | ||
|
|
||
| def getScheduleRate(self): |
There was a problem hiding this comment.
33行注释放到36行下面。下同。因为后续如果别人在中间插了个函数,就不知道这段注释对应的是哪个函数了。
lcy-seso
left a comment
There was a problem hiding this comment.
Some small modifications.
| if __name__ == "__main__": | ||
| schedule_generator = RandomScheduleGenerator("linear", 0.1, 500000) | ||
| true_token_flag = schedule_generator.processBatch(5) | ||
| pdb.set_trace() |
There was a problem hiding this comment.
please delete debug related codes.
| @@ -0,0 +1,56 @@ | |||
| import numpy as np | |||
| import math | |||
| import pdb | |||
There was a problem hiding this comment.
please remove the debug module.
| decoder_state=decoder_mem) | ||
|
|
||
| gru_out_memory = paddle.layer.memory( | ||
| name='gru_out', size=target_dict_dim) # , boot_with_const_id=0) |
There was a problem hiding this comment.
please remove useless comment.
lcy-seso
left a comment
There was a problem hiding this comment.
Scheduled sampling should not be used in generation. Multiplex layer should only be created in training.
| src_embedding = paddle.layer.embedding( | ||
| input=src_word_id, | ||
| size=word_vector_dim, | ||
| param_attr=paddle.attr.ParamAttr(name='_source_language_embedding')) |
There was a problem hiding this comment.
Because the parameter name _source_language_embedding is not explicitly used, it can be removed to avoid such a hard code.
| return data_reader | ||
|
|
||
|
|
||
| def seqToseq_net(source_dict_dim, target_dict_dim, is_generating=False): |
There was a problem hiding this comment.
comment on parameters like in random_schedule_generator.py.
| input=backward_first) | ||
|
|
||
| def gru_decoder_with_attention_train(enc_vec, enc_proj, true_word, | ||
| true_token_flag): |
There was a problem hiding this comment.
comment on parameters like in random_schedule_generator.py.
|
|
||
| return out | ||
|
|
||
| def gru_decoder_with_attention_test(enc_vec, enc_proj, current_word): |
There was a problem hiding this comment.
comment on parameters like in random_schedule_generator.py.
| param_attr=paddle.attr.ParamAttr(name='_target_language_embedding')) | ||
|
|
||
| current_word = paddle.layer.multiplex( | ||
| input=[true_token_flag, true_word, generated_word_emb]) |
There was a problem hiding this comment.
This layer should not be created in generating, because, in the generation, generated word is always used.
There was a problem hiding this comment.
The multiplex layer is in the function gru_decoder_with_attention_train, which is only called during training.
| size=target_dict_dim, | ||
| embedding_name='_target_language_embedding', | ||
| embedding_size=word_vector_dim) | ||
| group_inputs.append(trg_embedding) |
There was a problem hiding this comment.
In the generation, target embedding is unknown. This configuration is not reasonable.
There was a problem hiding this comment.
The type of trg_embedding is GeneratedInputV2. It shares the embedding matrix of the target language with the embedding matrix during training. It doesn't use ground-truth target words as inputs.
|
I have revised the comments and added the document in README.md. |
| param_attr=paddle.attr.ParamAttr(name='_target_language_embedding')) | ||
|
|
||
| current_word = paddle.layer.multiplex( | ||
| input=[true_token_flag, true_word, generated_word_emb]) |
| """ | ||
| The decoder step for training. | ||
| :param enc_vec: the encoder vector for attention | ||
| :type enc_vec: Layer |
| :param enc_vec: the encoder vector for attention | ||
| :type enc_vec: Layer | ||
| :param enc_proj: the encoder projection for attention | ||
| :type enc_proj: Layer |
| :param enc_proj: the encoder projection for attention | ||
| :type enc_proj: Layer | ||
| :param true_word: the ground-truth target word | ||
| :type true_word: Layer |
| :param true_token_flag: the flag of using the ground-truth target word | ||
| :type true_token_flag: Layer | ||
| :return: the softmax output layer | ||
| :rtype: Layer |
scheduled_sampling/README.md
Outdated
| - 反向Sigmoid衰减:`epsilon_i=k/(k+exp(i/k))`,其中`k>1`,`k`同样控制衰减的幅度。 | ||
|
|
||
| ## 模型实现 | ||
| 由于Scheduled Sampling是对Sequence to Sequence模型的改进,其整体实现框架与Sequence to Sequence模型较为相似。为突出本文重点,这里仅介绍与Scheduled Sampling相关的部分,完整的代码见`scheduled_sampling.py`。 |
There was a problem hiding this comment.
与scheduled sampling相关的,包括:
- 采样概率如何衰减
- multiplex layer如何使用
都需要解释,这几组产生采样概率的函数,超参数设置原则?
| 这里对数据reader进行封装,加入从`RandomScheduleGenerator`采样得到的`true_token_flag`作为另一组数据输入,控制解码使用的元素。 | ||
|
|
||
| ```python | ||
| schedule_generator = RandomScheduleGenerator("linear", 0.75, 1000000) |
There was a problem hiding this comment.
0.75, 1000000 这两个值是怎么选择的,请在 README 中解释,否则,用户很难确定这两个值的设置从何而来。
There was a problem hiding this comment.
前面提了下超参数需要用户调优。后面调优后会替换这两个值,并说明这是调优后的结果。
| indexes = (numbers >= rate).astype('int32').tolist() | ||
| self.data_processed_ += batch_size | ||
| return indexes | ||
| ``` |
There was a problem hiding this comment.
这样贴一段代码的效果和直接看代码是没啥区别的,请解释怎么使用,初始参数怎么设置。
scheduled_sampling/README.md
Outdated
| self.data_processed_ += batch_size | ||
| return indexes | ||
| ``` | ||
| 其中`__init__`方法定义了几种不同的衰减概率,`processBatch`方法根据该概率进行采样,最终确定解码时是使用真实元素还是使用生成的元素。 |
There was a problem hiding this comment.
贴一段代码,在代码后面附上这样一句话是没有有效的解释,和直接看代码是没区别的,作为读者看完这句话是充满了疑惑的。
-
其中
__init__方法定义了几种 --> 定义了几种?这几种怎么选择?参数怎么设置?请有效地与上文介绍进行管理,指代不请。 -
processBatch方法根据该概率进行采样 --> 该概率指代上一句的__init__里面定义的吗?__init__里面接受超参数,采样概率是如何变化的? -
最终确定解码时是使用真实元素还是使用生成的元素。 --> 怎么确定的?
scheduled_sampling/README.md
Outdated
| 其中`__init__`方法定义了几种不同的衰减概率,`processBatch`方法根据该概率进行采样,最终确定解码时是使用真实元素还是使用生成的元素。 | ||
|
|
||
|
|
||
| 这里对数据reader进行封装,加入从`RandomScheduleGenerator`采样得到的`true_token_flag`作为另一组数据输入,控制解码使用的元素。 |
There was a problem hiding this comment.
- 这里对数据reader进行封装 --> 请展开多写两句,这句话放在这里,为啥对reader进行封装?请不要让读者去想。。。
- 控制解码使用的元素。--> 这里并不涉及“解码”过程,通常把生成整个序列称之为解码。
|
请提multiplex layer 的v2 接口的PR,否则这个例子merege 之后无法运行。 |
| @@ -1 +1,164 @@ | |||
| TBD | |||
| # Scheduled Sampling | |||
There was a problem hiding this comment.
标题建议改成中文,下同所有的"Scheduled Sampling"
There was a problem hiding this comment.
这个例子不需要标准的中文翻译,用英文即可。暂时我也没有遇到广泛接受的中文翻译。
scheduled_sampling/README.md
Outdated
|
|
||
| ## 概述 | ||
| 序列生成任务的训练目标是在给定源输入的条件下,最大化目标序列的概率。训练时该模型将目标序列中的真实元素作为解码阶段每一步的输入,然后最大化下一个元素的概率。生成时上一步解码得到的元素被用作当前的输入,然后生成下一个元素。可见这种情况下训练阶段和生成阶段的解码层输入数据的概率分布并不一致。如果序列前面生成了错误的元素,后面的输入状态将会收到影响,而该误差会随着生成过程不断向后累积。 | ||
| Scheduled Sampling是一种解决训练和生成时输入数据分布不一致的方法。在训练早期该方法主要使用真实元素作为解码输入,以将模型从随机初始化的状态快速引导至一个合理的状态。随着训练的进行该方法会逐渐更多的使用生成元素作为解码输入,以解决数据分布不一致的问题。 |
scheduled_sampling/README.md
Outdated
| # Scheduled Sampling | ||
|
|
||
| ## 概述 | ||
| 序列生成任务的训练目标是在给定源输入的条件下,最大化目标序列的概率。训练时该模型将目标序列中的真实元素作为解码阶段每一步的输入,然后最大化下一个元素的概率。生成时上一步解码得到的元素被用作当前的输入,然后生成下一个元素。可见这种情况下训练阶段和生成阶段的解码层输入数据的概率分布并不一致。如果序列前面生成了错误的元素,后面的输入状态将会收到影响,而该误差会随着生成过程不断向后累积。 |
There was a problem hiding this comment.
- 如果有训练目标,就应该写生成目标。要不两个目标都可以不写,这儿建议可以全去掉,只讲训练和生成时的不同数据分布情况。
- “如果序列前面生成了错误的元素,后面的输入状态将会收到影响,而该误差会随着生成过程不断向后累积。”是引入Scheduled Sampling的原因么?如果不是,可以去掉。
scheduled_sampling/README.md
Outdated
|
|
||
| ## 概述 | ||
| 序列生成任务的训练目标是在给定源输入的条件下,最大化目标序列的概率。训练时该模型将目标序列中的真实元素作为解码阶段每一步的输入,然后最大化下一个元素的概率。生成时上一步解码得到的元素被用作当前的输入,然后生成下一个元素。可见这种情况下训练阶段和生成阶段的解码层输入数据的概率分布并不一致。如果序列前面生成了错误的元素,后面的输入状态将会收到影响,而该误差会随着生成过程不断向后累积。 | ||
| Scheduled Sampling是一种解决训练和生成时输入数据分布不一致的方法。在训练早期该方法主要使用真实元素作为解码输入,以将模型从随机初始化的状态快速引导至一个合理的状态。随着训练的进行该方法会逐渐更多的使用生成元素作为解码输入,以解决数据分布不一致的问题。 |
There was a problem hiding this comment.
- 在训练早期该方法主要使用真实元素作为解码输入:真实元素应该是目标序列的真实元素
- 以将-》可以将
- 随着训练的进行,该方法XXX (全文注意分句)
scheduled_sampling/README.md
Outdated
| Scheduled Sampling是一种解决训练和生成时输入数据分布不一致的方法。在训练早期该方法主要使用真实元素作为解码输入,以将模型从随机初始化的状态快速引导至一个合理的状态。随着训练的进行该方法会逐渐更多的使用生成元素作为解码输入,以解决数据分布不一致的问题。 | ||
|
|
||
| ## 算法简介 | ||
| Scheduled Sampling主要应用在Sequence to Sequence模型的训练上,而生成阶段则不需要使用。 |
There was a problem hiding this comment.
- 主要应用在序列到序列模型的训练阶段,生成阶段不需要使用。
- Sequence to Sequence 改成 “序列到序列”,下同。
| 序列生成任务的训练目标是在给定源输入的条件下,最大化目标序列的概率。训练时该模型将目标序列中的真实元素作为解码阶段每一步的输入,然后最大化下一个元素的概率。生成时上一步解码得到的元素被用作当前的输入,然后生成下一个元素。可见这种情况下训练阶段和生成阶段的解码层输入数据的概率分布并不一致。如果序列前面生成了错误的元素,后面的输入状态将会收到影响,而该误差会随着生成过程不断向后累积。 | ||
| Scheduled Sampling是一种解决训练和生成时输入数据分布不一致的方法。在训练早期该方法主要使用真实元素作为解码输入,以将模型从随机初始化的状态快速引导至一个合理的状态。随着训练的进行该方法会逐渐更多的使用生成元素作为解码输入,以解决数据分布不一致的问题。 | ||
|
|
||
| ## 算法简介 |
There was a problem hiding this comment.
算法简介最好有图,现在的描述方式,小白用户看的很晕。
|
If you are not going to finish this pr, please tell me. I will do it myself. |
|
I will finish it ASAP. Sorry for the delay. @lcy-seso |
|
@wwhu You're welcome. I think the work is almost finished, only after small modifications, we can merge it first, and then keep on refining it. Thanks for your work. |
lcy-seso
left a comment
There was a problem hiding this comment.
Need to fix a small bug due to the updates of PaddlePaddle.
| @@ -1 +1,164 @@ | |||
| TBD | |||
| # Scheduled Sampling | |||
There was a problem hiding this comment.
这个例子不需要标准的中文翻译,用英文即可。暂时我也没有遇到广泛接受的中文翻译。
|
|
||
| return cost | ||
| else: | ||
| trg_embedding = paddle.layer.GeneratedInputV2( |
There was a problem hiding this comment.
这里有一个小问题,paddle 最近升级了, GeneratedInputV2 和 StaticInputV2 两个函数的 V2 后缀都不再需要。全部替换为 GeneratedInput和StaticInput吧。否则会报错。
lcy-seso
left a comment
There was a problem hiding this comment.
Almost LGTM, I will further refactor and validate this demo.
resolve #11
Note: This model may encounter "Floating point exception" after training on several mini-batches.
Scheduled sampling need to use the api multiplex_layer (PaddlePaddle/Paddle#1753), which has not been implemented in current Paddle version. I implemented this layer in my repositories (https://github.com/wwhu/Paddle/blob/ss-dev/python/paddle/trainer_config_helpers/layers.py). I will post a PR to the official Paddle repository after I write the unit test for it.