[Feature] Support multi-turns conversations #27

HIT-cwh · 2023-08-14T03:58:25Z

We support the following three data set formats:

// Incremental pretraining dataset
{
    'conversation':
        [
            {
                'input': '',
                'output': '### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant: "Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.\n\nRecent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies control a significant portion of the market (Bivens & Mishel, 2013). In these industries, workers often face low wages, limited benefits, and reduced bargaining power, leading to a situation where they are dependent on the employer for their livelihood. This dependence can result in further suppression of wages and a decline in working conditions.\n\nOverall, the concept of monopsony is essential to understanding the dynamics of labor markets and the impact of market power on workers. Further research is needed to understand the extent and impact of monopsonies on the economy and to develop policies to address this issue.\n\nReferences:\nBivens, J., & Mishel, L. (2013). The Pay of Corporate Executives and Financial Professionals as Evidence of Rents in Top 1 Percent Incomes. Journal of Economic Perspectives, 27(3), 57-78.### Human: Now explain it to a dog'
            }
        ]
},
// Single-turn conversation dataset.
{
    'conversation':
        [
            {
                'input': 'Give three tips for staying healthy.',
                'output': '1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. 2. Exercise regularly to keep your body active and strong. 3. Get enough sleep and maintain a consistent sleep schedule.'
            }
        ]
},
// Multi-turn conversation dataset.
// Note that only the output part is considered in the calculation of loss.
{
    'conversation':
        [
            {
                'input': '### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.',
                'output': '### Assistant: "Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.\n\nRecent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies control a significant portion of the market (Bivens & Mishel, 2013). In these industries, workers often face low wages, limited benefits, and reduced bargaining power, leading to a situation where they are dependent on the employer for their livelihood. This dependence can result in further suppression of wages and a decline in working conditions.\n\nOverall, the concept of monopsony is essential to understanding the dynamics of labor markets and the impact of market power on workers. Further research is needed to understand the extent and impact of monopsonies on the economy and to develop policies to address this issue.\n\nReferences:\nBivens, J., & Mishel, L. (2013). The Pay of Corporate Executives and Financial Professionals as Evidence of Rents in Top 1 Percent Incomes. Journal of Economic Perspectives, 27(3), 57-78.'
            },
            {
                'input': '### Human: Now explain it to a dog.',
                'output': 'Monopsony means one big buyer controls everything. It can lead to lower wages and fewer opportunities for dogs. We need fair competition and rules to protect us. More bones, please! Woof!'
            }
        ]
}

The performance of training llama2 for 500 iterations on the multi-round conversation dataset derived from the oasst1 dataset.

* support multi-turns conversations * refact multi-turns conversation * support oasst1 instruction tuning

* add prefetch * update prefetch * add janus * add janus * fix * fix * fix llama position id error * fix ProcessPoolExecutor * update * fix llama * delete cache

* minimum dependency sft * fix dispatch * add timer * add tgs * internlm2 tp * rms support tp * gradient checkpointing * lazy load pretrain * temp * fix bugs * add data pipeline example * fix lints * remove useless code * fix hard pack bug * add comments * clean code * add shard strategy * support cpu offload * support cpu offload * trust remote code * fix soft packer bug * fix soft packer bug * fix soft packer bug * refactor data pipeline * fixup * fix pad tokens bug * check input_ids and labels * check input_ids and labels in collator * fix load local datasets bug * fix load cache datasts * restore dset order * save cached infos * accelerate start up * avoid all gather cached datasets * fixup * fix cache bug * Support group length (#4) * replace rmsnorm kernel * suport ftdp ds * suport load_bin * suport group by maxlen * add fsdp_ftdp_sft and fix fsdp_sft * suport ftdp ds * add lr min * fix bugs * fix bugs * delete * support llava * support packer cache * refactor dist load * Add sp tp (#5) * support sp and tp * add fsdp_tp_sft and modify fsdp_sft * move chat_template * fix load_ds * delete useless codes * delete useless codes * fix jsonl load * refactor * fix bug * fix lr scheduler * refactor setup parallel * update data load * fix bugs * move fsdp * adapt new parallel load * fix setup_parallel (#7) * fix some bugs * add remote codes * add convert script * support load image from ceph * support load image from ceph * fix cache dataset bugs * support mulit images * support llava interleave * fix load timeout * refactor datasets: optimize the cache mechanism and clean up code * distinguish dataset components based on algorithms * support fsdp2+3d parallel * fix lints * support contiguous batching * refactor parallel * zero wasting ppo * support asend npu * fix openai convert * fix npu bugs * fix npu bug * dispatch npu flash attn * adapt asend npu * fix ppo losses * steady increase in reward * faster ppo * fix top-p generate * support internlm3 * baseline 2.5 * fix internlm3 * (ing)support hard pack * support qwen2 * fix dataset bugs * baseline * del ppo.py * fixup * support hybrid sp * fix hybrid sp * qwen2 + hybird sp * fix requirements * avoid re-initialize dist * support group pack * pretrain (#13) * first commit: support internlm3 moe streaming dataset * move codes * Moe pretrain (#14) * first commit: support internlm3 moe streaming dataset * move codes * rmsnorm kernel support low version flash_attn * add barrier * support prompt length control (#15) * support VLM Base (#16) * add internvl * fix bug * remove dup code * support liger of internvl * fix bug * add get_repo_git_info * fix * add minicpmv * add minicpmv dispatch * accelerate tokenize * Updata InternVL (#17) * fix dpo error * fix sp error * update dataset * fix * fix rand sampler (#18) * llama support transformers >= 4.45 (#19) * convert fsdp1 to fsdp2 in sft.py * [Feature] Support Liger Kernel (#20) * filter data by max length (#21) * fix causal forward, prefetch, and remote code (#22) * [Enhancement] Accelerating Data Pipeline (#23) * sample ratio greater than 1.0 and trunc max len * accelerating the counting of tokens * log reduced loss * fix mirco bs greater than 1 * [Enhancement] Ensure data integrity when the sampling ratio is more than 1 (#24) * repeat dataset * fixup * fix typos * fix typos * [Fix] Pass in temperature during generation (#25) * Support Janus and fix some error (#27) * add prefetch * update prefetch * add janus * add janus * fix * fix * fix llama position id error * fix ProcessPoolExecutor * update * fix llama * delete cache * remove useless code --------- Co-authored-by: whcao <41630003+HIT-cwh@users.noreply.github.com> Co-authored-by: Happy <lsb19@tsinghua.org.cn> Co-authored-by: Haian Huang(深度眸) <1286304229@qq.com>

* [Feature] XTuner Lite (#974) * minimum dependency sft * fix dispatch * add timer * add tgs * internlm2 tp * rms support tp * gradient checkpointing * lazy load pretrain * temp * fix bugs * add data pipeline example * fix lints * remove useless code * fix hard pack bug * add comments * clean code * add shard strategy * support cpu offload * support cpu offload * trust remote code * fix soft packer bug * fix soft packer bug * fix soft packer bug * refactor data pipeline * fixup * fix pad tokens bug * check input_ids and labels * check input_ids and labels in collator * fix load local datasets bug * fix load cache datasts * restore dset order * save cached infos * accelerate start up * avoid all gather cached datasets * fixup * fix cache bug * Support group length (#4) * replace rmsnorm kernel * suport ftdp ds * suport load_bin * suport group by maxlen * add fsdp_ftdp_sft and fix fsdp_sft * suport ftdp ds * add lr min * fix bugs * fix bugs * delete * support llava * support packer cache * refactor dist load * Add sp tp (#5) * support sp and tp * add fsdp_tp_sft and modify fsdp_sft * move chat_template * fix load_ds * delete useless codes * delete useless codes * fix jsonl load * refactor * fix bug * fix lr scheduler * refactor setup parallel * update data load * fix bugs * move fsdp * adapt new parallel load * fix setup_parallel (#7) * fix some bugs * add remote codes * add convert script * support load image from ceph * support load image from ceph * fix cache dataset bugs * support mulit images * support llava interleave * fix load timeout * refactor datasets: optimize the cache mechanism and clean up code * distinguish dataset components based on algorithms * support fsdp2+3d parallel * fix lints * support contiguous batching * refactor parallel * zero wasting ppo * support asend npu * fix openai convert * fix npu bugs * fix npu bug * dispatch npu flash attn * adapt asend npu * fix ppo losses * steady increase in reward * faster ppo * fix top-p generate * support internlm3 * baseline 2.5 * fix internlm3 * (ing)support hard pack * support qwen2 * fix dataset bugs * baseline * del ppo.py * fixup * support hybrid sp * fix hybrid sp * qwen2 + hybird sp * fix requirements * avoid re-initialize dist * support group pack * pretrain (#13) * first commit: support internlm3 moe streaming dataset * move codes * Moe pretrain (#14) * first commit: support internlm3 moe streaming dataset * move codes * rmsnorm kernel support low version flash_attn * add barrier * support prompt length control (#15) * support VLM Base (#16) * add internvl * fix bug * remove dup code * support liger of internvl * fix bug * add get_repo_git_info * fix * add minicpmv * add minicpmv dispatch * accelerate tokenize * Updata InternVL (#17) * fix dpo error * fix sp error * update dataset * fix * fix rand sampler (#18) * llama support transformers >= 4.45 (#19) * convert fsdp1 to fsdp2 in sft.py * [Feature] Support Liger Kernel (#20) * filter data by max length (#21) * fix causal forward, prefetch, and remote code (#22) * [Enhancement] Accelerating Data Pipeline (#23) * sample ratio greater than 1.0 and trunc max len * accelerating the counting of tokens * log reduced loss * fix mirco bs greater than 1 * [Enhancement] Ensure data integrity when the sampling ratio is more than 1 (#24) * repeat dataset * fixup * fix typos * fix typos * [Fix] Pass in temperature during generation (#25) * Support Janus and fix some error (#27) * add prefetch * update prefetch * add janus * add janus * fix * fix * fix llama position id error * fix ProcessPoolExecutor * update * fix llama * delete cache * remove useless code --------- Co-authored-by: whcao <41630003+HIT-cwh@users.noreply.github.com> Co-authored-by: Happy <lsb19@tsinghua.org.cn> Co-authored-by: Haian Huang(深度眸) <1286304229@qq.com> * support mlu (#984) * cleanup * add internlm3 remote code * cleanup * auto patch * remove useless code --------- Co-authored-by: whcao <41630003+HIT-cwh@users.noreply.github.com> Co-authored-by: Happy <lsb19@tsinghua.org.cn> Co-authored-by: Haian Huang(深度眸) <1286304229@qq.com> Co-authored-by: Lantian Zhang <50076473+DoorKickers@users.noreply.github.com>

HIT-cwh added 3 commits August 14, 2023 11:57

support multi-turns conversations

088ef5f

refact multi-turns conversation

1d236f1

support oasst1 instruction tuning

3886812

LZHgrla merged commit 264e744 into InternLM:main Aug 15, 2023

llkn-2 pushed a commit to llkn-2/xtuner that referenced this pull request Jul 31, 2024

[Feature] Support multi-turns conversations (InternLM#27)

0a02c41

* support multi-turns conversations * refact multi-turns conversation * support oasst1 instruction tuning

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Support multi-turns conversations #27

[Feature] Support multi-turns conversations #27

Uh oh!

HIT-cwh commented Aug 14, 2023 •

edited

Loading

Labels

2 participants

[Feature] Support multi-turns conversations #27

[Feature] Support multi-turns conversations #27

Uh oh!

Conversation

HIT-cwh commented Aug 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Labels

2 participants

HIT-cwh commented Aug 14, 2023 •

edited

Loading