Skip to content

CTR demo#57

Merged
lcy-seso merged 36 commits intoPaddlePaddle:developfrom
Superjomn:develop
Jun 1, 2017
Merged

CTR demo#57
lcy-seso merged 36 commits intoPaddlePaddle:developfrom
Superjomn:develop

Conversation

@Superjomn
Copy link
Contributor

数据处理部分篇幅太长,单独写了一个markdown文件

generate脚本稍后补充

文档是org mode写的,之后转成.md文件,所以会有 .org的文件,可以用一个单独目录隐藏起来

@lcy-seso

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

关于形式化的初步意见。

ctr/README.org Outdated
@@ -0,0 +1,178 @@
#+title: 使用 Wide & Deep neural model 进行 CTR 预估
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

未来会将markdown 转为 html,这里删除org 文件吧。

ctr/README.md Outdated

<a id="org8f6a6fa"></a>

# 引用
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 使用二级标题,## 参考文献,目前每一篇里面只保留一个一级标题。
  2. 参考文献直接使用数字列表,去掉方括号。在引用文献的地方使用: [1] 这样的标记。
  3. 论文也请附上链接
ctr/README.md Outdated

下图展示了 LR 和一个 \(3x2\) 的 NN 模型的结构:

![img](./images/lr-vs-dnn.jpg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctr/README.md Outdated

<a id="orgab346e7"></a>

# 数据和任务抽象
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

每一篇只有一个一级标题,这里修改为二级标题

ctr/README.md Outdated


<a id="orgc299c2a"></a>

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

一级标题:#点击率预估,以后各小节为二级,三级等标题。

act=paddle.activation.Relu(),
name='dnn-fc-%d' % no)
_input_layer = fc
return _input_layer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

172 ~ 173 多余的空行去掉。

ctr/README.md Outdated

```

<a id="orgb4020a9"></a>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些标记先从markdown中删除,后面html统一渲染。


params = paddle.parameters.create(classification_cost)

optimizer = paddle.optimizer.Momentum(momentum=0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

没有 paddle.init()不会出问题吗?

ctr/dataset.md Outdated
- `C14-C21` &#x2013; anonymized categorical variables


<a id="orgeaf74d5"></a>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

html 标记先去掉吧。

ctr/dataset.org Outdated
return res
#+END_SRC


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参考文献作为单独的一节,二级标题。

@lcy-seso
Copy link
Collaborator

lcy-seso commented May 26, 2017

  1. 版本库中目前只保存markdown,和其它项目保持统一,请先删掉 org文件
  2. 后续会将markdwon 自动转换成html 和 jupyter notebook等形式,尽量使用markdown 原生语法。
  3. 用pre-commit做一下格式化,否则travis-CI 检查过不了。
Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

针对文档的一些修改建议。

ctr/README.md Outdated

下图展示了 LR 和一个 \(3x2\) 的 NN 模型的结构:

![img](./images/lr-vs-dnn.jpg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 图没有居中
  2. 缺少图题
  3. 图片的命名统一使用“_”代替“-”和repo中其他例子保持一致。"lr-vs-dnn.jpg" --> "lr_vs_dnn.jpg"
  4. 和其它例子保持一致,使用下面的标记:


Figure 1. ×

ctr/README.md Outdated

![img](./images/lr-vs-dnn.jpg)

LR 的蓝色箭头部分可以直接类比到 NN 中对应的结构,可以看到 LR 和 NN 有一些共通之处(比如权重累加),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NN 是不是应该改成 DNN更好一些?

ctr/README.md Outdated

### LR vs DNN

下图展示了 LR 和一个 \(3x2\) 的 NN 模型的结构:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NN 是不是应该改为DNN更合适一些?因为上文并没有出现 NN 这个术语。

ctr/README.md Outdated
LR 的蓝色箭头部分可以直接类比到 NN 中对应的结构,可以看到 LR 和 NN 有一些共通之处(比如权重累加),
但前者的模型复杂度在相同输入维度下比后者可能低很多(从某方面讲,模型越复杂,越有潜力学习到更复杂的信息)。

如果 LR 要达到匹敌 NN 的学习能力,必须增加输入的维度,也就是增加���征的数量,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NN --> DNN。上文提出了DNN,但是没有提到NN。会为阅读者带来困惑。

ctr/README.md Outdated
我们可以将 `click` 作为学习目标,具体任务可以有以下几种方案:

1. 直接学习 click,0,1 作二元分类
2. Learning to rank, 具体用 pairwise rank(标签 1>0)或者 list rank
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list --> listwise

ctr/dataset.md Outdated

### 类别型特征

类别型特征有有限多种值,在模型中,我们一般使用 embedding table 将每种值映射为连续值的向量。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Embedding table --> Embedding

def __repr__(self):
return '<CategoryFeatureGenerator %d>' % len(self.dic)
```

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

增加一两句描述用户应该如何使用这个类来处理数据呢?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


def size(self):
return self.max_dim
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

增加一两句描述用户应该如何使用这个类来处理数据呢?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


def size(self):
return self.max_dim
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

增加一两句描述用户应该如何使用这个类来处理数据呢。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

ctr/dataset.md Outdated

## 输入到 PaddlePaddle 中

Deep 和 Wide 两部分均以 `sparse_binary_vector` 的格式[1]输入,输入前需要将相关特征拼合,模型最终只接受 3 个 input,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参考文献的标记请改为:[1]

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

关于文档的一些小意见。

ctr/README.md Outdated

### 模型简介

Wide & Deep Learning Model[3] 可以作为一种相对成熟的模型框架使用,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参考文献的引用格式还是不太对。比如这里的3:

\[[3](#参考文献)\]
ctr/README.md Outdated

我们直接使用第一种方法做分类任务。

我们使用 Kaggle 上 `Click-through rate prediction` 任务的数据集[\[2\]](https://www.kaggle.com/c/avazu-ctr-prediction/data) 来演示模型。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参考文献的引用格式还是不太对。比如这里的[2]:

\[[2](#参考文献)\]
ctr/README.md Outdated

## 背景介绍

CTR(Click-Through Rate)[\[1\]](https://en.wikipedia.org/wiki/Click-through_rate) 是用来表示用户点击一个特定链接的概率,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参考文献的引用格式还是不太对。比如这里的[1]:

\[[1](#参考文献)\]
ctr/README.md Outdated

<p align="center">
<img src="images/lr_vs_dnn.jpg" width="620" hspace='10'/> <br/>
Figure 1. LR 和DNN模型结构对比
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

风格保持一致,DNN 前后都增加一个空格。

ctr/README.md Outdated

## 数据和任��抽象

我们可以将 `click` 作为学习目标,具体任务可以有以下几种方案:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

具体任务可以有以下几种方案: --> 具体的,任务可以有以下几种方案:

feeding=field_index,
event_handler=event_handler,
num_passes=100)
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 增加一个章节:##运行训练和测试
  2. 略微做一个简单的,step by step 的描述来解释 clone 了这个repo的用户该如何一步一步执行本例中的这套脚本,例如包括以下内容:
    • 先运行哪个脚本下载数据/准备环境。
    • 运行哪个脚本启动训练任务,是否需要修改某些参数。
    • 告诉用户那个脚本负责读数据,如果需要feed 自己的数据,应该修改哪个脚本。
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

ctr/dataset.md Outdated
2. newid = id % N
3. 用 newid 作为类别类特征使用

上面的方法尽管存在一定的碰撞概率,但能够处理任意数量的 ID 特征,并保留一定的效果[2]。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参考文献的标记还是有些问题:

\[[2](#参考文献)\]
ctr/dataset.md Outdated

`CategoryFeatureGenerator` 需要先扫描数据集,得到该类别对应的项集合,之后才能开始生成特征。

我们的实验数据集[\[3\]](https://www.kaggle.com/c/avazu-ctr-prediction/data)已经经过shuffle,可以扫描前面一定数目的记录来近似总的类别项集合(等价于随机抽样),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参考文献3 的标记有些问题。

\[[3](#参考文献)\]
Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

增加一个自动下载数据的脚本。

## 运行训练和测试
训练模型需要如下步骤:

1. 下载训练数据,可以使用 Kaggle 上 CTR 比赛的数据\[[2](#参考文献)\]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 仿照models下sequence_tagging_for_ner在这个例子,增加一个data文件夹,data 文件夹下天机一个获取数据的脚本,https://github.com/PaddlePaddle/models/blob/develop/sequence_tagging_for_ner/data/download.sh
  2. train.py增加一个main函数,main函数中指定usage中提到的四个函数的默认参数。
  3. 最���效果:用户首先执行下载数据脚本,再执行train.py 可以直接运行训练任务。
@lcy-seso
Copy link
Collaborator

lcy-seso commented Jun 1, 2017

Kaggle 的数据集无法通过脚本直接下载。修改一下README,加上一个step by step的过程,如何将原始数据提供给train.py脚本,启动训练任务:

  • 原始数据下载下是什么样的一个文件。
  • 需要做哪些处理?(比如解压)
  • 给 train.py 增加一个默认的main函数,可以直接执行。
@reyoung
Copy link
Collaborator

reyoung commented Jun 1, 2017

It seems an issue about virtualenv let this Unittest failed.

The related issue is here.

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lcy-seso lcy-seso merged commit 1af0222 into PaddlePaddle:develop Jun 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants