Skip to content

Commit b6dcad1

Browse files
authored
KubeRay IPEX demo added for OpenShift (#10)
* dir structure changed * distributed sample for ipex added * namespace creating command added * CPU env enabled by default * Some updates based on review
1 parent 747068b commit b6dcad1

24 files changed

+472
-0
lines changed
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
ARG BASE_IMAGE
2+
FROM $BASE_IMAGE as ray-ipex-cpu
3+
4+
RUN echo "unset BASH_ENV PROMPT_COMMAND ENV" >> ${CPU_ENV}/bin/activate
5+
6+
ENV BASH_ENV="${CPU_ENV}/bin/activate" \
7+
ENV="${CPU_ENV}/bin/activate" \
8+
PROMPT_COMMAND=". ${CPU_ENV}/bin/activate"
9+
10+
#Intall multi-node packages
11+
RUN python -m pip install --no-cache-dir ray[data,train,tune,serve] && \
12+
python -m pip install --no-cache-dir --extra-index-url=https://pytorch-extension.intel.com/release-whl/stable/cpu/us/ oneccl_bind_pt==2.1.0
Lines changed: 334 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,334 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "1e4160f6-5d79-418a-a46f-4a50ec56db62",
6+
"metadata": {},
7+
"source": [
8+
"# Distributed PyTorch Training on OpenShift using KubeRay Operator and Intel® Extension for PyTorch*"
9+
]
10+
},
11+
{
12+
"cell_type": "markdown",
13+
"id": "a20d3ca9-2f58-43b0-b7aa-2e80eb588ad0",
14+
"metadata": {
15+
"tags": []
16+
},
17+
"source": [
18+
"This notebook demonstrates utilizing the Intel Extension for PyTorch to optimize distributed workloads on Intel hardware with RedHat OpenShift AI and KubeRay operator. For this demo we finetune a Large Language Model from HuggingFace tranformers on 2 or more nodes. The notebook uses codeflare SDK to create a Ray cluster and launch a distributed training job on it."
19+
]
20+
},
21+
{
22+
"cell_type": "markdown",
23+
"id": "cbf33887-b3ad-4c0c-b5d7-7a3fd4ffa1fa",
24+
"metadata": {
25+
"tags": []
26+
},
27+
"source": [
28+
"## Install CodeFlare SDK"
29+
]
30+
},
31+
{
32+
"cell_type": "code",
33+
"execution_count": null,
34+
"id": "60ae1703-9ebe-4fe5-8278-bede84609a8e",
35+
"metadata": {
36+
"tags": []
37+
},
38+
"outputs": [],
39+
"source": [
40+
"! pip install codeflare-sdk==0.14.1"
41+
]
42+
},
43+
{
44+
"cell_type": "markdown",
45+
"id": "27db73de-d7ee-4587-91b7-25eaf4df51e5",
46+
"metadata": {
47+
"tags": []
48+
},
49+
"source": [
50+
"## Importing necessary codeflare SDK modules"
51+
]
52+
},
53+
{
54+
"cell_type": "code",
55+
"execution_count": null,
56+
"id": "bded3130-c62c-4773-94bd-b008e2564900",
57+
"metadata": {
58+
"tags": []
59+
},
60+
"outputs": [],
61+
"source": [
62+
"from codeflare_sdk.cluster.cluster import Cluster, ClusterConfiguration\n",
63+
"from codeflare_sdk.cluster.auth import TokenAuthentication\n",
64+
"from codeflare_sdk.job.ray_jobs import RayJobClient"
65+
]
66+
},
67+
{
68+
"cell_type": "markdown",
69+
"id": "262b0fc6-ca66-4c82-a439-65d03dc699b5",
70+
"metadata": {},
71+
"source": [
72+
"## Authenticating login to the OCP cluster\n",
73+
"\n",
74+
"\n",
75+
"**NOTE: Please fill in the value of variable auth_token below.**\n",
76+
"\n",
77+
"To find out the token please use the RedHat OpenShift Container Platform's online console."
78+
]
79+
},
80+
{
81+
"cell_type": "code",
82+
"execution_count": null,
83+
"id": "6992dd4a",
84+
"metadata": {},
85+
"outputs": [],
86+
"source": [
87+
"#Variables for user to be set.\n",
88+
"\n",
89+
"auth_token = \"XXXX\"\n",
90+
"api_server = \"XXXX\"\n",
91+
"registry = \"XXXX\""
92+
]
93+
},
94+
{
95+
"cell_type": "code",
96+
"execution_count": null,
97+
"id": "dc5c7791-a2fc-4b2d-af40-6b65ec8b976e",
98+
"metadata": {
99+
"tags": []
100+
},
101+
"outputs": [],
102+
"source": [
103+
"auth = TokenAuthentication(\n",
104+
" token=auth_token,\n",
105+
" server=api_server,\n",
106+
" skip_tls=True)"
107+
]
108+
},
109+
{
110+
"cell_type": "code",
111+
"execution_count": null,
112+
"id": "5a50572b-f237-4ca4-a2d0-91623504f846",
113+
"metadata": {
114+
"tags": []
115+
},
116+
"outputs": [],
117+
"source": [
118+
"auth.login()"
119+
]
120+
},
121+
{
122+
"cell_type": "markdown",
123+
"id": "44dbaf70-3622-464f-8f0c-2e8df7a91611",
124+
"metadata": {
125+
"tags": []
126+
},
127+
"source": [
128+
"## Launch a Ray cluster using Codeflare SDK."
129+
]
130+
},
131+
{
132+
"cell_type": "code",
133+
"execution_count": null,
134+
"id": "e7ba13aa-5608-489d-bbf3-e2d55f340295",
135+
"metadata": {
136+
"tags": []
137+
},
138+
"outputs": [],
139+
"source": [
140+
"cluster = Cluster(ClusterConfiguration(\n",
141+
" name='ray-ipex-demo',\n",
142+
" namespace='ray-ipex',\n",
143+
" num_workers=2,\n",
144+
" head_memory=20,\n",
145+
" head_cpus=32,\n",
146+
" min_cpus=32,\n",
147+
" max_cpus=32,\n",
148+
" min_memory=20,\n",
149+
" max_memory=20,\n",
150+
" num_gpus=0,\n",
151+
" image=\"{0}/ray-ipex/ray-ipex:latest\".format(registry),\n",
152+
" instascale=False,\n",
153+
" openshift_oauth=True\n",
154+
"))"
155+
]
156+
},
157+
{
158+
"cell_type": "code",
159+
"execution_count": null,
160+
"id": "1d75bc8b-76fd-4b3e-9946-154bfc823dc3",
161+
"metadata": {
162+
"tags": []
163+
},
164+
"outputs": [],
165+
"source": [
166+
"cluster.up()"
167+
]
168+
},
169+
{
170+
"cell_type": "code",
171+
"execution_count": null,
172+
"id": "b71ba73a-a575-4bf0-99c1-e7bb07539189",
173+
"metadata": {
174+
"tags": []
175+
},
176+
"outputs": [],
177+
"source": [
178+
"#This call waits for cluster to be ready before going to the next instruction\n",
179+
"cluster.wait_ready()"
180+
]
181+
},
182+
{
183+
"cell_type": "markdown",
184+
"id": "b1fbff39-468f-4a0c-8a4c-7b79807ae5ee",
185+
"metadata": {},
186+
"source": [
187+
"## List the details of the created Ray cluster and the dashboard access link."
188+
]
189+
},
190+
{
191+
"cell_type": "code",
192+
"execution_count": null,
193+
"id": "dbbf6687-ce55-4739-8035-ec85dc06bee6",
194+
"metadata": {
195+
"tags": []
196+
},
197+
"outputs": [],
198+
"source": [
199+
"cluster.details()"
200+
]
201+
},
202+
{
203+
"cell_type": "markdown",
204+
"id": "7811c17a-2eb5-4f20-9bea-cb612dc84c01",
205+
"metadata": {},
206+
"source": [
207+
"## Launch the distributed job"
208+
]
209+
},
210+
{
211+
"cell_type": "code",
212+
"execution_count": null,
213+
"id": "115a5391-35d8-4eda-a488-02e2e92e27a7",
214+
"metadata": {
215+
"tags": []
216+
},
217+
"outputs": [],
218+
"source": [
219+
"# Gather the dashboard URL\n",
220+
"ray_dashboard = cluster.cluster_dashboard_uri()\n",
221+
"\n",
222+
"# Create the header for passing your bearer token\n",
223+
"header = {\n",
224+
" 'Authorization': f'Bearer {auth_token}'\n",
225+
"}\n",
226+
"\n",
227+
"# Initialize the RayJobClient\n",
228+
"client = RayJobClient(address=ray_dashboard, headers=header, verify=False)"
229+
]
230+
},
231+
{
232+
"cell_type": "code",
233+
"execution_count": null,
234+
"id": "9cfd2432-a063-47dd-8b00-b4c676f742c2",
235+
"metadata": {
236+
"tags": []
237+
},
238+
"outputs": [],
239+
"source": [
240+
"# Submit the LLM finetuning job using the RayJobClient\n",
241+
"submission_id = client.submit_job(\n",
242+
" entrypoint=\"python LLM.py\",\n",
243+
" runtime_env={\"working_dir\": \"./\",\"pip\": \"requirementsLLM.txt\",\n",
244+
" \"env_vars\": {'CCL_WORKER_COUNT': '1'}},\n",
245+
")\n",
246+
"print(\"The Job's submission ID is: {} which can be used to stop or delete the job.\".format(submission_id))"
247+
]
248+
},
249+
{
250+
"cell_type": "markdown",
251+
"id": "af738cf1-27ef-4be1-8a97-a8797b2bf074",
252+
"metadata": {},
253+
"source": [
254+
"## Print the logs from the running job"
255+
]
256+
},
257+
{
258+
"cell_type": "code",
259+
"execution_count": null,
260+
"id": "09a1fb4b-d535-40de-9f1a-acf6fe0951d3",
261+
"metadata": {
262+
"scrolled": true,
263+
"tags": []
264+
},
265+
"outputs": [],
266+
"source": [
267+
"async for lines in client.tail_job_logs(submission_id):\n",
268+
" print(lines, end=\"\") "
269+
]
270+
},
271+
{
272+
"cell_type": "markdown",
273+
"id": "f4dce7dd-87ee-484a-b827-3fa67bbce931",
274+
"metadata": {},
275+
"source": [
276+
"#### NOTE: IF YOU WANT TO STOP OR DELETE THE JOB PLEASE UNCOMMENT THE CODE."
277+
]
278+
},
279+
{
280+
"cell_type": "code",
281+
"execution_count": null,
282+
"id": "677d8843-88c8-4589-bb37-22a3cf922bd7",
283+
"metadata": {
284+
"tags": []
285+
},
286+
"outputs": [],
287+
"source": [
288+
"#client.stop_job(submission_id)\n",
289+
"#client.delete_job(submission_id)"
290+
]
291+
},
292+
{
293+
"cell_type": "markdown",
294+
"id": "d83984b6-a8ab-4afd-8b7d-7afa40eb70a9",
295+
"metadata": {},
296+
"source": [
297+
"## Stopping the cluster once all jobs are finished."
298+
]
299+
},
300+
{
301+
"cell_type": "code",
302+
"execution_count": null,
303+
"id": "8587a548-dcf0-4b19-9cdb-7c2a1bcbe184",
304+
"metadata": {
305+
"tags": []
306+
},
307+
"outputs": [],
308+
"source": [
309+
"cluster.down()"
310+
]
311+
}
312+
],
313+
"metadata": {
314+
"kernelspec": {
315+
"display_name": "pytorch-cpu",
316+
"language": "python",
317+
"name": "pytorch-cpu"
318+
},
319+
"language_info": {
320+
"codemirror_mode": {
321+
"name": "ipython",
322+
"version": 3
323+
},
324+
"file_extension": ".py",
325+
"mimetype": "text/x-python",
326+
"name": "python",
327+
"nbconvert_exporter": "python",
328+
"pygments_lexer": "ipython3",
329+
"version": "3.9.18"
330+
}
331+
},
332+
"nbformat": 4,
333+
"nbformat_minor": 5
334+
}

0 commit comments

Comments
 (0)