New api about checkpoint and models#10878
Conversation
python/paddle/fluid/io.py
Outdated
| checkpoint_dir = os.getcwd() | ||
| raise ValueError("The values of 'checkpoint_dir' should not be None") | ||
|
|
||
| if trainer_args and not isinstance(trainer_args, dict): |
There was a problem hiding this comment.
what're the details about trainer_args? Shall we need a class instead of a parameter with dict type, it's confusing for users.
There was a problem hiding this comment.
trainer_args not for the user, just for the developers, currently, trainer_args only contains step_id, epoch_id, there maybe have more arguments need to be saved in the checkpoint.
There was a problem hiding this comment.
Please use the clear parameters, it would be confused with other developers, use step_id and epoch_id or a Class as the configuration parameter.
There was a problem hiding this comment.
I write two functions to make it more clearly.
| save_trainer_args(cur_dir, trainer_id, trainer_args) | ||
|
|
||
| if is_chief: | ||
| save_persist_vars_without_grad(executor, cur_dir, main_program) |
There was a problem hiding this comment.
It looks all gradient vars are all not persistent, so maybe the function name would shorter for save_persistent_vars? BTW, persist is a verb, we need the adjective one: persistent .
There was a problem hiding this comment.
First, I find arguments named "X@GRAD" are persistent .
Second, save_persistent_vars do not filter RAW arguments.
| if "@GRAD" in var.name: | ||
| return False | ||
|
|
||
| if ".trainer_" in var.name: |
There was a problem hiding this comment.
Can you add some comments to explain what's the meaning of the hard code .blcok and .trainer_ ?
python/paddle/fluid/io.py
Outdated
|
|
||
|
|
||
| def load_checkpoint(executor, checkpoint_dir=None, main_program=None): | ||
| def need_load_checkpoint(checkpoint_dir): |
There was a problem hiding this comment.
I'm confusing about this function, for need to do ... means the function should return a boolean variable, so that we can use the function as:
if need_load_checkpoint(xxx):
# resume from the checkpoint
else:
# train from the beginningThere was a problem hiding this comment.
I renamed need_load_checkpoint to get_latest_checkpoint_serial, make it more meaningful.
python/paddle/fluid/io.py
Outdated
| """ | ||
| if checkpoint_dir is None: | ||
| checkpoint_dir = os.getcwd() | ||
| raise ValueError("The values of 'checkpoint_dir' should not be None") |
There was a problem hiding this comment.
checkpoint_dir should not be None.
There was a problem hiding this comment.
checkpoint_dir should be checked in Trainer.py, A property directory will be given by Trainer.py.
python/paddle/fluid/io.py
Outdated
|
|
||
|
|
||
| def load_checkpoint(executor, checkpoint_dir=None, main_program=None): | ||
| def get_latest_checkpoint_serial(checkpoint_dir): |
There was a problem hiding this comment.
Seems we don't need get_latest_checkpoint_serial , do the same thing with _get_latest_checkpoint_dir, or we can rename _get_latest_checkpoint_dir to get_latest_checkpoint_serial.
python/paddle/fluid/io.py
Outdated
| if checkpoint_dir is None: | ||
| checkpoint_dir = os.getcwd() | ||
| raise ValueError( | ||
| "The values of 'checkpoint_dir' or 'serial' should not be None") |
There was a problem hiding this comment.
"The values of 'checkpoint_dir' or 'serial' should not be None")
Seems here only check checkpoint_dir ?
python/paddle/fluid/io.py
Outdated
| if serial < 0: | ||
| return | ||
| if main_program is None: | ||
| raise ValueError("The values of 'main_program'should not be None") |
There was a problem hiding this comment.
The values of 'main_program'should not be None
main_program should not be None.
python/paddle/fluid/io.py
Outdated
| load_persist_vars_without_grad will load variables from a directory by an executor, | ||
| the variable named end with "@GRAD" will not be loaded. | ||
|
|
||
| :param executor |
There was a problem hiding this comment.
Please add the details comments about the parameters.
python/paddle/fluid/trainer.py
Outdated
| # only | ||
| trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0")) | ||
| self.trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0")) | ||
| self.chief = self.trainer_id == 0 |
There was a problem hiding this comment.
Why hard code trainer_id and chief ere equal to 0 ?
There was a problem hiding this comment.
We will default initialize trainer_id =0 and chief = True as default.
If run PaddlePaddle as local, there is only one trainer, its trainer_id is 0, and it is the chief obviously.
If run PaddlePaddle as distribution, we will get PADDLE_TRAINER_ID from env, there will only have one trainer as the chief.
python/paddle/fluid/io.py
Outdated
| checkpoint_dir = os.getcwd() | ||
| raise ValueError("The values of 'checkpoint_dir' should not be None") | ||
|
|
||
| if trainer_args and not isinstance(trainer_args, dict): |
There was a problem hiding this comment.
Please use the clear parameters, it would be confused with other developers, use step_id and epoch_id or a Class as the configuration parameter.
Yancey0623
left a comment
There was a problem hiding this comment.
Please add a unit test for the checkpoint feature.
Yancey0623
left a comment
There was a problem hiding this comment.
I have some comments above, please follow them thanks.
python/paddle/fluid/io.py
Outdated
| return trainer_dir | ||
|
|
||
|
|
||
| def _lru_delete(dirname, max_num_checkpoints=3): |
There was a problem hiding this comment.
Seems this function does not do implement a real LRU algorithms, scroll_delete would be better.
python/paddle/fluid/io.py
Outdated
| int(serial) | ||
| except ValueError: | ||
| serial = _get_dir_serial(cur_dir) | ||
| if serial == -1: |
There was a problem hiding this comment.
please merge the two condition statement.
python/paddle/fluid/trainer.py
Outdated
| return self._get_parallel_executor() | ||
|
|
||
| def _clean_checkpoint(self): | ||
| if not self.checkpoint: |
There was a problem hiding this comment.
use assert instead of return directly, otherwise we don't know whether this function success.
python/paddle/fluid/trainer.py
Outdated
| return trainer_args | ||
|
|
||
| def _save_checkpoint(self, epoch_id, step_id): | ||
| if not self.checkpoint: |
There was a problem hiding this comment.
The same reason, use assert please.
python/paddle/fluid/io.py
Outdated
| :param checkpoint_dir | ||
| :param max_num_checkpoints | ||
| :param save_interval_secs | ||
| :param trainer_id |
There was a problem hiding this comment.
Need more details comments.
|
|
||
| class TestCheckpoint(unittest.TestCase): | ||
| def setUp(self): | ||
| self.dirname = "/tmp/ckpt" |
There was a problem hiding this comment.
better to use tempfile instead of the hard code path.
Yancey0623
left a comment
There was a problem hiding this comment.
LGTM, this feature is related to the API for user, please @typhoonzero double check.
python/paddle/fluid/io.py
Outdated
| :param executor executor for save the value | ||
| :param checkpoint_dir the checkpoint directory | ||
| :param trainer_id currect trainer id | ||
| :param is_chief if the trainer id equals 0, the is_chief will be true |
There was a problem hiding this comment.
If have is_chief why still need to pass trainer_id?
There was a problem hiding this comment.
each trainer need to save its arguments practicality.
Only chief need to save variables.
There was a problem hiding this comment.
I have deleted code about chief
| def _get_serial_dir(serial, checkpoint_dir): | ||
| serial_folder = CHECKPOINT_PREFIX + CHECKPOINT_SEPARATOR + str(serial) | ||
| return os.path.join(checkpoint_dir, serial_folder) | ||
| def load_persist_vars_without_grad(executor, |
There was a problem hiding this comment.
write load_persist_vars_without_grad just because of the filter.
I need to write a new filter to filter variables.
|
The test seems didn't pass? |
|
I will check it. |
load_model/save_modelAPIRelated update fluid Train API param_path to checkpoint_config #10828
Related Incremental Learning Support for Fluid with Distribution #10870