Modify path format and file format#2083
Conversation
| We select CephFS to store our data. | ||
|
|
||
| 选择CephFS作为训练数据的存储服务。 | ||
| From the perspective of user program running in a Pod, it is only I/O with the local filesystem, as |
There was a problem hiding this comment.
把”it is only I/O with the local filesystem“改成”it is mounted locally“比较好,因为实际上通过Ceph的IO一般是网络I/O,不是local filesystem I/O。
| ```python | ||
| # ... | ||
| reader = paddle.reader.creator.SSTable("/home/random_images-*-of-*") | ||
| reader = paddle.reader.creator.RecordIO("/home/random_images-*-of-*") |
There was a problem hiding this comment.
这里可以改成/home/user_name/random_images-*-of-*
| ```bash | ||
| paddle cp filenames pfs://home/folder/ | ||
| ```bash | ||
| paddle pfs cp filenames /pfs/folder/ |
There was a problem hiding this comment.
这里和下面应该改成:/pfs/$DATACENTER/home/$USER/folder/
在这之前需要解释下/pfs/代表远程,$DATACENTER/代表远程的哪个datacenter,以及用户会在某个datacenter启动job,所以程序只能看到对应datacenter的目录:/home/$USER/被mount在了本地。
There was a problem hiding this comment.
@helinwang 如果需要考虑包含$DATACENTER,那$DATACENTER的标示出现在目录中是比较费解的。这样又回出现大家后来看这里的代码时会考虑“为什么要把$DATACENTER作为目录的一层?当时怎么考虑的?”。
参考s3的API设计,每个datacenter会存在一个API通信的"endpoint"(一个DNS地址)。客户端会和这个endpoint通信。用户上传和存储的文件的路径都从"/"开始。同理$USER也应该在单独的位置配置。
我的想法是:需要有一个配置文件保存用户的认证信息和默认的datacenter地址:
[pfs]
username=wuyi
usercert=wuyi.pem
userkey=wuyi-key.pem
endpoint=datacenter1.paddlepaddle.org
调用的命令是:
paddle pfs upload myfile /
paddle pfs download /myfile
paddle pfs mv /myfile /myfile2
...
There was a problem hiding this comment.
同意data center放在config里!
There was a problem hiding this comment.
跟@wangkuiyi 讨论了一下,他的建议是:
# config file
[datacenter_1]
username=wuyi
usercert=wuyi.pem
userkey=wuyi-key.pem
endpoint=datacenter1.paddlepaddle.org
[datacenter_2]
username=wuyi
usercert=wuyi.pem
userkey=wuyi-key.pem
endpoint=datacenter2.paddlepaddle.org
To upload:
# do not have permission to /pfs/datacenter_2/home/john/
paddle pfs cp file.txt /pfs/datacenter_2/home/david//pfs/datacenter_2 is used to differentiate local from remote.
In train.py, one can reference the file by /home/david/file.txt
We need the home directory because in the future we may want John's directory to be visible to David.
There was a problem hiding this comment.
It is pretty common that a user might need to access more than one clusters. For example, I need to access the Key State Lab's cluster and the 3 or 4-node cluster @typhoonzero set up for development and demonstrations.
There was a problem hiding this comment.
如果我们采用文件路径来区分是否为云端目录,根据不同的组合方式paddle pfs cp可以实现四个功能local to local, local to remote, remote to local, remote to remote 四个功能,相反要实现上述四个功能可能需要四个不同的命令,前者看起来更简单一些。
There was a problem hiding this comment.
感觉使用cp, mv这样的命令是否也会费解呢?应为
- cp 本地到远程 == upload (包含是否覆盖参数)
- cp 远程到本地 == download
- cp 远程到远程 ==
paddle pfs cp - cp 本地到本地 ==
cp ...
关于文件共享,我们仍然可以不使用/home目录来简化用户的理解:用户在共享文件给他人时,会生成一个包含token的链接:https://pfs.paddlepaddle.org/[somehash],用户执行:paddle pfs add https://pfs.paddlepaddle.org/[somehash] /John_data可以在自己的目录下生成一个符号链接指向被共享的数据。
| 1. some shared directories, e.g., the pre-downloaded `paddle.v2.dataset` data, should have been mapped to the Pod-local directory `/common`. | ||
|
|
||
| 在CephFS存储系统中的公开目录,需要保存一些预置的公开数据集(比如MNIST, BOW, ImageNet数据集等),并且可以被提交的job直接使用。 | ||
| and from the perspective of our client tool `paddle`, it has to refer to files in the distributed filesystem in a special format, just like `/pfs/$DATACENTER/home/$USER/cifa/...`. |
There was a problem hiding this comment.
mount上去之后没有/pfs/$DATACENTER/了,只是/home/$USER/cifa/...,/pfs/$DATACENTER/是paddle pfs命令代表远程的参数namespace。
There was a problem hiding this comment.
这个是从paddle client tool的角度看的。:)
There was a problem hiding this comment.
@gongweibao 是否可以保持一致呢,在pod中也以/pfs/$DATACENTER/目录存放?
| We select CephFS to store our data. | ||
|
|
||
| 选择CephFS作为训练数据的存储服务。 | ||
| From the perspective of user program running in a Pod, it is mounted locally, as |
There was a problem hiding this comment.
Kubernetes will mount the CephFS path into the Pod which user's code running in.
| From the perspective of user program running in a Pod, it is mounted locally, as | ||
|
|
||
| 在Kubernetes上运行的不同的计算框架,可以通过Volume或PersistentVolume挂载存储空间到每个容器中。 | ||
| 1. the home directory should have been mapped to the Pod-local directory `/home`, and |
There was a problem hiding this comment.
不可以Mount到Pod的/home目录,习惯上Linux将/home作为用户的根目录,/data什么的都可以吧?
There was a problem hiding this comment.
@Yancey1989 感觉确实是拿来做/home的?
另外是不是应该改成/home/$USER
There was a problem hiding this comment.
不应该使用/home/$USER
- 在Pod的容器中执行用户程序的用户是容器中的root用户。容器内的用户和使用Cloud的用户不是同一个。
- Pod容器只是为执行提供环境,mount的目录不应该影响当前的文件系统结构。
- /home/$USER还会存储
.bashrc等配置,这些默认在container的storage layer中已经存在,没必要在存储中保存这些内容,增加碎片文件。存储中只需专注保存数据。
There was a problem hiding this comment.
统一使用/pfs/$DATACENTER/home/$USER来访问数据
| ### 文件预处理 | ||
|
|
||
| 在数据集可以被训练之前,文件需要预先被转换成PaddlePaddle集群内部的存储格式(SSTable)。我们提供两个转换方式: | ||
| 在数据集可以被训练之前,文件需要预先被转换成PaddlePaddle集群内部的存储格式[RecordIO](https://github.com/PaddlePaddle/Paddle/issues/1947)。我们提供两个转换方式: |
There was a problem hiding this comment.
在数据集可以被训练之前 => 在开始训练之前, 数据集需要预先被转换成PaddlePaddle分布式训练使用的存储格式:RecordIO
| ```bash | ||
| paddle cp filenames pfs://home/folder/ | ||
| ```bash | ||
| paddle pfs cp filenames /pfs/$DATACENTER/home/$USER/folder/ |
There was a problem hiding this comment.
是filenames还是filepath? 如果是filenames会不会容易被误解成paddle pfs cp file1 file2 file3.... /pfs/$DATACENTOR/home/...
There was a problem hiding this comment.
这里确实应该是可以支持paddle pfs cp file1 file2 file3.... /pfs/$DATACENTOR/home/
gongweibao
left a comment
There was a problem hiding this comment.
根据Comments修改完毕
| ### 文件���处理 | ||
|
|
||
| 在数据集可以被训练之前,文件需要预先被转换成PaddlePaddle集群内部的存储格式(SSTable)。我们提供两个转换方式: | ||
| 在数据集可以被训练之前,文件需要预先被转换成PaddlePaddle集群内部的存储格式[RecordIO](https://github.com/PaddlePaddle/Paddle/issues/1947)。我们提供两个转换方式: |
| ```bash | ||
| paddle cp filenames pfs://home/folder/ | ||
| ```bash | ||
| paddle pfs cp filenames /pfs/$DATACENTER/home/$USER/folder/ |
There was a problem hiding this comment.
这里确实应该是可以支持paddle pfs cp file1 file2 file3.... /pfs/$DATACENTOR/home/
| From the perspective of user program running in a Pod, it is mounted locally, as | ||
|
|
||
| 在Kubernetes上运行的不同的计算框架,可以通过Volume或PersistentVolume挂载存储空间到每个容器中。 | ||
| 1. the home directory should have been mapped to the Pod-local directory `/home`, and |
There was a problem hiding this comment.
统一使用/pfs/$DATACENTER/home/$USER来访问数据
typhoonzero
left a comment
There was a problem hiding this comment.
LGTM except for one tiny comment.
| ``` | ||
| # config file | ||
| [datacenter_1] | ||
| username=wuyi |
| 控制用户权限 | ||
|
|
||
| - `/pfs/$DATACENTER/common`数据集合只读不能写 | ||
| - 现在mount到本地以后有读写权限 |
There was a problem hiding this comment.
现在mount到本地以后有读写权限
这句话的意思是说mount到本地以后有/pfs/$DATACENTER/common的读写权限么?感觉除了管理员,否则不应该对common目录有写权限啊。
| ```bash | ||
| paddle cp filenames pfs://home/folder/ | ||
| ```bash | ||
| paddle pfs cp filenames /pfs/$DATACENTER/home/$USER/folder/ |
There was a problem hiding this comment.
如果filenames是表示支持多个参数的话建议改成paddle pfs cp {filename} [{filename}] /pfs/$DATACENTER/home/$USER/folder吧。
| 例如: | ||
|
|
||
| ``` | ||
| f = open('/pfs/datacenter_name/home/user_name/test1.dat') |
There was a problem hiding this comment.
这一行不是很理解,不应该是f = open('/home/user_name/test1.dat')吗?
There was a problem hiding this comment.
- 对Pod来说,mount到什么地方其实都没有问题。用这样的路径主要是可以和PFSClient的访问路径一致。
- 对以后来说,我们可以不用mount的方式,直接用这样的路径
'/pfs/datacenter_name/home/user_name/test1.dat',通过API来访问数据。wangyi 说google不用mount,使用上述的方式直接访问的。
View
Fix according to #1947 and #1953 comment