Simple pipe reader for hdfs or other service#5282
Simple pipe reader for hdfs or other service#5282typhoonzero merged 3 commits intoPaddlePaddle:developfrom
Conversation
| def pipe_reader(left_cmd, | ||
| parser, | ||
| bufsize=8192, | ||
| file_type="plain", |
There was a problem hiding this comment.
Maybe we just need to support "plain", the user can decompress it outside of Paddle using pipe.
There was a problem hiding this comment.
Thought it may be inconvenient for users to decompress stream data in their parsers.
There was a problem hiding this comment.
I meant the user can decompress the data using shell commands, not in the parsers, e.g.:
hadoop fs -cat /path/to/some/file | gzip -dThere was a problem hiding this comment.
Well, this is simpler, but I'm considering the pipe size using bash is set by ulimit, when in cluster trainer, users may not have control over every node's ulimit configuration, but using python code can.
There was a problem hiding this comment.
I don't understand bash very well, but does the pipe just "block" if it's full, and probably gzip can decode in a stream fashion, and will consume the pipe buffer, so it will be unblocked.
There was a problem hiding this comment.
By default, pipes can block both producer and consumer:
If a process attempts to read from an empty pipe, then read(2) will
block until data is available. If a process attempts to write to a
full pipe (see below), then write(2) blocks until sufficient data has
been read from the pipe to allow the write to complete.
Well, my point is, use pipes in python code, can let users to define pipe buffer size which is critical to the reader performance.
| return xreader | ||
|
|
||
|
|
||
| def _buf2lines(buf, line_break="\n"): |
There was a problem hiding this comment.
line break won't work in binary data, maybe we should let parser decide when to output a new data item?
There was a problem hiding this comment.
If cut_lines=False the binary data will send to parser directly. Do you mean by should let user's parser generate data, and make pipe_reader a decorator?
There was a problem hiding this comment.
Yes, I thought maybe pipe_reader should not cut the lines, since it does not have sufficient information, we might want leave it to the user's parser to do so (cut and generate data).
There was a problem hiding this comment.
Agree, will update.
fix pipe_reader unimport packages
helinwang
left a comment
There was a problem hiding this comment.
LGTM! Let's update https://github.com/PaddlePaddle/Paddle/pull/5282/files#r150348324 with a follow up commit.
Fix #5011