Getting everything before a period using regex?

Question

I have a string that looks like this:

STRING 1 160 Some descriptor information. /Uselessstuff.; STRING 161 274 Some other descriptor information. /Moreuselessstuff.; STRING 275 1070 Last descriptor info. /Lastuselesspart.

Now I would like to extract the two integers and the information that follows up to the period then ignore the everything till either the end of the string or till the semicolon. So I would hope to end up with:

[('1', '160', 'Some descriptor information'), ('161', '274', 'Some other descriptor information'), ('275', '1070', 'Last descriptor info')]

I've tried:

import re
s = "STRING 1 160 Some descriptor information. /Uselessstuff.; STRING 161 274 Some other descriptor information. /Moreuselessstuff.; STRING 275 1070 Last descriptor info. /Lastuselesspart."
re.findall(r'(\d+)\s(\d+)\s(\w+)', s)

However, this only gives the following:

[('1', '160', 'Some'), ('161', '274', 'Some'), ('275', '1070', 'Last')]

How can I get the rest of the information up to the period?

Avinash Raj · Accepted Answer · 2014-06-30 12:37:17Z

Your regex would be,

(\d+)\s(\d+)\s([^\.]*)

DEMO

Your python code would be,

>>> s = "STRING 1 160 Some descriptor information. /Uselessstuff.; STRING 161 274 Some other descriptor information. /Moreuselessstuff.; STRING 275 1070 Last descriptor info. /Lastuselesspart."
>>> m = re.findall(r'(\d+)\s(\d+)\s([^\.]*)', s)
>>> m
[('1', '160', 'Some descriptor information'), ('161', '274', 'Some other descriptor information'), ('275', '1070', 'Last descriptor info')]

Explanation:

(\d+) Captures one or more digits into a group.
\s Above captured digits would be followed by a space.
(\d+) Again one or more digits are captured into second group.
\s Followed by a single space.
([^\.]*) Captures any character not of a literal dot zero or more times.

Pete · Accepted Answer · 2014-06-30 12:30:25Z

3

Using [^.]+ instead of \w+ will select all characters up to a period.

answered Jun 30, 2014 at 12:30

Pete

1,3232 gold badges9 silver badges25 bronze badges

Comments

hwnd · Accepted Answer · 2014-06-30 12:32:40Z

3

You can use a Character Class to allow only word characters and whitespace.

>>> re.findall(r'(\d+)\s*(\d+)\s*([\w\s]+)', s)
[('1', '160', 'Some descriptor information'), ('161', '274', 'Some other descriptor information'), ('275', '1070', 'Last descriptor info')]

Working Demo

answered Jun 30, 2014 at 12:32

hwnd

70.9k4 gold badges100 silver badges135 bronze badges

Collectives™ on Stack Overflow

Getting everything before a period using regex?

3 Answers 3

1 Comment

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Related