Converting a String to a List of Words?

Question

I'm trying to convert a string to a list of words using python. I want to take something like the following:

string = 'This is a string, with words!'

Then convert to something like this :

list = ['This', 'is', 'a', 'string', 'with', 'words']

Notice the omission of punctuation and spaces. What would be the fastest way of going about this?

gilgamar · Accepted Answer · 2012-12-06 00:22:28Z

118

I think this is the simplest way for anyone else stumbling on this post given the late response:

>>> string = 'This is a string, with words!'
>>> string.split()
['This', 'is', 'a', 'string,', 'with', 'words!']

answered Dec 6, 2012 at 0:22

gilgamar

1,2231 gold badge8 silver badges2 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Levon Over a year ago

You need to separate and eliminate the punctuation from the words (e.g., "string," and "words!"). As it, this does not meet OP's requirements.

Daniel Sam · Accepted Answer · 2018-04-11 15:51:25Z

Try this:

import re

mystr = 'This is a string, with words!'
wordList = re.sub("[^\w]", " ",  mystr).split()

How it works:

From the docs :

re.sub(pattern, repl, string, count=0, flags=0)

Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function.

so in our case :

pattern is any non-alphanumeric character.

[\w] means any alphanumeric character and is equal to the character set [a-zA-Z0-9_]

a to z, A to Z , 0 to 9 and underscore.

so we match any non-alphanumeric character and replace it with a space .

and then we split() it which splits string by space and converts it to a list

so 'hello-world'

becomes 'hello world'

with re.sub

and then ['hello' , 'world']

after split()

let me know if any doubts come up.

Remember to handle apostrophes and hyphens, too, since they're not included in \w.
You may want to handle formatted apostrophes and non-breaking hyphens, too.

Tim McNamara · Accepted Answer · 2011-05-31 00:15:21Z

To do this properly is quite complex. For your research, it is known as word tokenization. You should look at NLTK if you want to see what others have done, rather than starting from scratch:

>>> import nltk
>>> paragraph = u"Hi, this is my first sentence. And this is my second."
>>> sentences = nltk.sent_tokenize(paragraph)
>>> for sentence in sentences:
...     nltk.word_tokenize(sentence)
[u'Hi', u',', u'this', u'is', u'my', u'first', u'sentence', u'.']
[u'And', u'this', u'is', u'my', u'second', u'.']

JBernardo · Accepted Answer · 2011-05-31 02:19:14Z

22

The most simple way:

>>> import re
>>> string = 'This is a string, with words!'
>>> re.findall(r'\w+', string)
['This', 'is', 'a', 'string', 'with', 'words']

answered May 31, 2011 at 2:19

JBernardo

33.6k13 gold badges92 silver badges120 bronze badges

Comments

mtrw · Accepted Answer · 2011-05-31 00:29:48Z

14

Using string.punctuation for completeness:

import re
import string
x = re.sub('['+string.punctuation+']', '', s).split()

This handles newlines as well.

edited May 31, 2011 at 0:29

answered May 31, 2011 at 0:24

mtrw

35.3k7 gold badges66 silver badges73 bronze badges

Comments

martineau · Accepted Answer · 2011-05-31 00:26:25Z

9

Well, you could use

import re
list = re.sub(r'[.!,;?]', ' ', string).split()

Note that both string and list are names of builtin types, so you probably don't want to use those as your variable names.

edited May 31, 2011 at 0:26

martineau

124k29 gold badges181 silver badges319 bronze badges

answered May 31, 2011 at 0:10

Cameron

99.4k29 gold badges206 silver badges233 bronze badges

Comments

Paulo Freitas · Accepted Answer · 2017-06-08 09:55:37Z

Inspired by @mtrw's answer, but improved to strip out punctuation at word boundaries only:

import re
import string

def extract_words(s):
    return [re.sub('^[{0}]+|[{0}]+$'.format(string.punctuation), '', w) for w in s.split()]

>>> str = 'This is a string, with words!'
>>> extract_words(str)
['This', 'is', 'a', 'string', 'with', 'words']

>>> str = '''I'm a custom-built sentence with "tricky" words like https://stackoverflow.com/.'''
>>> extract_words(str)
["I'm", 'a', 'custom-built', 'sentence', 'with', 'tricky', 'words', 'like', 'https://stackoverflow.com']

Akhil Cherian Verghese · Accepted Answer · 2018-05-18 05:47:09Z

4

Personally, I think this is slightly cleaner than the answers provided

def split_to_words(sentence):
    return list(filter(lambda w: len(w) > 0, re.split('\W+', sentence))) #Use sentence.lower(), if needed

answered May 18, 2018 at 5:47

Akhil Cherian Verghese

1,38313 silver badges16 bronze badges

Comments

tofutim · Accepted Answer · 2011-05-31 00:14:40Z

3

A regular expression for words would give you the most control. You would want to carefully consider how to deal with words with dashes or apostrophes, like "I'm".

answered May 31, 2011 at 0:14

tofutim

23.6k20 gold badges91 silver badges159 bronze badges

Comments

josliber · Accepted Answer · 2015-08-11 15:24:10Z

1

list=mystr.split(" ",mystr.count(" "))

edited Aug 11, 2015 at 15:24

josliber

44.4k12 gold badges103 silver badges136 bronze badges

answered Aug 11, 2015 at 15:14

sanchit

111 bronze badge

Comments

BenyaR · Accepted Answer · 2017-08-12 18:32:07Z

This way you eliminate every special char outside of the alphabet:

def wordsToList(strn):
    L = strn.split()
    cleanL = []
    abc = 'abcdefghijklmnopqrstuvwxyz'
    ABC = abc.upper()
    letters = abc + ABC
    for e in L:
        word = ''
        for c in e:
            if c in letters:
                word += c
        if word != '':
            cleanL.append(word)
    return cleanL

s = 'She loves you, yea yea yea! '
L = wordsToList(s)
print(L)  # ['She', 'loves', 'you', 'yea', 'yea', 'yea']

I'm not sure if this is fast or optimal or even the right way to program.

Kavindu Nilshan · Accepted Answer · 2022-02-04 12:43:01Z

1

def split_string(string):
    return string.split()

This function will return the list of words of a given string. In this case, if we call the function as follows,

string = 'This is a string, with words!'
split_string(string)

The return output of the function would be

['This', 'is', 'a', 'string,', 'with', 'words!']

answered Feb 4, 2022 at 12:43

Kavindu Nilshan

81911 silver badges23 bronze badges

Comments

Sunil Kumar Suman · Accepted Answer · 2024-12-06 16:55:40Z

Use the keyword method for starting the Dictionary comprehension and fill in the relevant parts.

You can get a list of the words in a string by using the .split() method: https://www.w3schools.com/python/ref_string_split.asp

sentence = "What is the Airspeed Velocity of an Unladen Swallow?"
split_word = sentence.split()
words = {word for word in split_word}
print(words)

output - {'What', 'the', 'of', 'an', 'Unladen', 'Airspeed', 'Swallow?', 
'Velocity', 'is'}

guest201505281433 · Accepted Answer · 2015-05-28 06:30:26Z

0

This is from my attempt on a coding challenge that can't use regex,

outputList = "".join((c if c.isalnum() or c=="'" else ' ') for c in inputStr ).split(' ')

The role of apostrophe seems interesting.

answered May 28, 2015 at 6:30

guest201505281433

1

Comments

Tomek K · Accepted Answer · 2021-03-15 20:03:30Z

Probably not very elegant, but at least you know what's going on.

my_str = "Simple sample, test! is, olny".lower()
my_lst =[]
temp=""
len_my_str = len(my_str)
number_letter_in_data=0
list_words_number=0
for number_letter_in_data in range(0, len_my_str, 1):
    if my_str[number_letter_in_data] in [',', '.', '!', '(', ')', ':', ';', '-']:
        pass
    else:
        if my_str[number_letter_in_data] in [' ']:
            #if you want longer than 3 char words
            if len(temp)>3:
                list_words_number +=1
                my_lst.append(temp)
                temp=""
            else:
                pass
        else:
            temp = temp+my_str[number_letter_in_data]
my_lst.append(temp)
print(my_lst)

What's the point of this solution if there exists a more optimal solution?

Olga Velichko · Accepted Answer · 2024-02-16 17:56:41Z

0

string = 'This is a string, with words!'

list = [word for word in string.split()]

print(list)

['This', 'is', 'a', 'string,', 'with', 'words!']

edited Feb 16, 2024 at 17:56

answered Feb 16, 2024 at 17:53

Olga Velichko

11 bronze badge

2 Comments

ramu Over a year ago

Hi Olga, this question already has an accepted answer, which meets the requirement of the questioner, and it is fairly simple enough. You may want to consider answering unanswered questions or if you have a better answer.

OCa Over a year ago

Besides, your answer suffers from the same problem than this one, most upvoted, yet inaccurate (see its comment below)

Paulo Freitas · Accepted Answer · 2017-06-08 09:06:14Z

-1

You can try and do this:

tryTrans = string.maketrans(",!", "  ")
str = "This is a string, with words!"
str = str.translate(tryTrans)
listOfWords = str.split()

edited Jun 8, 2017 at 9:06

Paulo Freitas

13.7k15 gold badges78 silver badges98 bronze badges

answered Aug 12, 2013 at 13:49

user2675185

191 bronze badge

Collectives™ on Stack Overflow

Converting a String to a List of Words?

17 Answers 17

1 Comment

3 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

1 Comment

2 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

17 Answers 17

1 Comment

3 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

1 Comment

2 Comments

Comments

Linked

Related