Add convert function#2407
Conversation
关于写Design Doc的一点总结
python/paddle/v2/dataset/common.py
Outdated
| return reader | ||
|
|
||
|
|
||
| def convert(output_path, eader, num_shards, name_prefix): |
python/paddle/v2/dataset/common.py
Outdated
| """ | ||
|
|
||
| def open_needs(idx): | ||
| n = "%s/%s-%05d" % (output_path, name_prefix, idx) |
There was a problem hiding this comment.
Should follow the format in the design doc:
random_images-00000-of-00099
random_images-00001-of-00099
...
random_images-00099-of-00099
python/paddle/v2/dataset/common.py
Outdated
| :param name_prefix: the name prefix of generated files. | ||
| """ | ||
|
|
||
| def open_needs(idx): |
There was a problem hiding this comment.
What does open_needs mean in English? Maybe a more descriptive name would be better.
python/paddle/v2/dataset/common.py
Outdated
| def open_needs(idx): | ||
| n = "%s/%s-%05d" % (output_path, name_prefix, idx) | ||
| w = recordio.writer(n) | ||
| f = open(n, "w") |
There was a problem hiding this comment.
Why do we need f? writer.Close already closes the file it opens.
python/paddle/v2/dataset/common.py
Outdated
| w = None | ||
| f = None | ||
|
|
||
| for i, d in enumerate(reader()): |
There was a problem hiding this comment.
Sorry I should have mentioned earlier, please consider this issue: #1915
To randomize, maybe we could have a shuffle_buffer_size as optional parameter. read until the buffer is full, shuffle and then write to RecordIO.
|
|
||
| def convert(output_path, eader, num_shards, name_prefix): | ||
| import recordio | ||
| import cPickle as pickle |
There was a problem hiding this comment.
Please add cPickle installation with fixed version in https://github.com/PaddlePaddle/Paddle/blob/develop/Dockerfile#L55 .
There was a problem hiding this comment.
Python 2.7 and Python 3.4 include the pickle and cPickle modules already. But I didn't find official answer, there is a answer
python/paddle/v2/dataset/common.py
Outdated
|
|
||
| w.write(pickle.dumps(d, pickle.HIGHEST_PROTOCOL)) | ||
|
|
||
| if i % num_shards == 0 and i >= num_shards: |
There was a problem hiding this comment.
这里的逻辑我不是很明白,假设一共有N个shard,这里的逻辑是每N个record,写入下一个shard。不应该是每一个record写入下一个shard吗?
是不是考虑这样:
var writers []writer
// fill writer
writer[i%num_shards].Write(record)
// close all writer once everything is done. Don't close and create a new writer frequently.There was a problem hiding this comment.
Done.一开始想错了,把num_shards想成每间隔多少个写入一个文件了。汗!
There was a problem hiding this comment.
一开始想如果文件数目比较多,而记录的个数比较少。那样的话会生成空文件。
| def test_convert(self): | ||
| def test_reader(): | ||
| def reader(): | ||
| for x in xrange(10): |
There was a problem hiding this comment.
I think this test will break when 10 is changed to 4. According to line 191 if i % num_shards == 0 and i >= num_shards:
helinwang
left a comment
There was a problem hiding this comment.
Great! Only one minor comment.
python/paddle/v2/dataset/common.py
Outdated
| reader, | ||
| num_shards, | ||
| name_prefix, | ||
| max_lines_to_shuffle=10000): |
There was a problem hiding this comment.
Maybe 1000 could be better. Say we have images of 200KB per image, 10000 of them is 1.9GB, might be too big in memory consumption.
Fix convert library