0

I have a string which has characters from multiple languages:

'죄송합니다 how are you doing? My name is Yudhiesh and I am 아니 doing good 저기요'

I am trying to chunk this single string into a list of strings based on the number of words in the string and the result should be this if the chunk size is 7 i.e. there are at maximum 7 words in a string:

['죄송합니다 how are you doing? My name', 'is Yudhiesh and I am 아니 doing', 'good 저기요']

My current attempt which is based on how you would chunk a list which is not working:

s = '죄송합니다 how are you doing? My name is Yudhiesh and I am 아니 doing good 저기요'
>>> parts = [str(s[i:i+7]) for i in range(0, len(s), 7)]
>>> parts
['죄송합니다 h', 'ow are ', 'you doi', 'ng? My ', 'name is', ' Yudhie', 'sh and ', 'I am 아니', ' doing ', 'good 저기', '요']
2
  • If overlaps are allowed, these are called ngrams stackoverflow.com/questions/13423919/… Commented Feb 11, 2021 at 15:23
  • @OneCricketeer I will have a look at it. Commented Feb 11, 2021 at 15:27

5 Answers 5

1

First, you can create a list of words, and then, create chunks and join them.

Here is what you need in a function:

def split_max_num(string, max_words):
    """
    >>> split_max_num('죄송합니다 how are you doing? My name is Yudhiesh and I am 아니 doing good 저기요', 7)
    ['죄송합니다 how are you doing? My name', 'is Yudhiesh and I am 아니 doing', 'good 저기요']
    """
    words = string.split()
    len_words = len(words)

    res = list()
    for index in range(0, len_words, max_words):
        res.append(' '.join(words[index:index+max_words]))
    return res
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks this worked on my other test cases as well.
1

How about the following ?

def split_max(words, n): 
    words = words.split()
    words = [words[i:i + n] for i in range(0, len(words), n)]
    return [' '.join(l) for l in words]


split_max(data, 7)

Comments

0

Are you sure you're not setting s=['죄송합니다 how are you doing? My name', 'is 도와 주세요 and I am doing', 'good 저기요'] before producing parts in this example ?

You may want to split your original string with "somestring".split(" ") i.e split on spaces to get a list of all words, then you can chop the list with indexing like you've tried to do.

1 Comment

you've tried: ` somestring = "yoursentence" wordlist = somestring.split(" ") sevens = [wordlist[i:i+7].join() for i in range(0, len(wordlist), 7)] ` and it doesn't work?
0

Use .split() on the string and chunk it:

from typing import List

def chunk_list(lst: List[str], chunk_size: int) -> List[List[str]]: 
    return [
        lst[i:i + chunk_size] 
        for i in range(0, len(lst), chunk_size)
    ]
     

def chunk_string(string: str, chunk_size: int) -> List[str]:
    return chunk_list(string.split(), chunk_size)

1 Comment

Thanks for the answer but it produces [['죄송합니다', 'how', 'are', 'you', 'doing?', 'My', 'name'], ['is', 'Yudhiesh', 'and', 'I', 'am', '아니', 'doing'], ['good', '저기요']]
0

You're converting a list into a string representation of a list.

I think you meant to rejoin the words

s = '죄송합니다 how are you doing? My name is Yudhiesh and I am 아니 doing good 저기요'
s = s.split()
print([" ".join(s[i:i+7]) for i in range(0, len(s), 7)]) 

4 Comments

Hi thanks for the answer but this produces ['죄 송 합 니 다 h', 'o w a r e ', 'y o u d o i', 'n g m y n', 'a m e i s ', 'Y u d h i e s', 'h a n d I', ' a m 아 니 ', 'd o i n g g', 'o o d 저 기 요']
What is s? The string, or the list of words?
Its the entire string
Okay, then you'll still have to split it for s[i:i+7] to get words rather than characters

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.