2

I have a string that looks like this:

STRING 1 160 Some descriptor information. /Uselessstuff.; STRING 161 274 Some other descriptor information. /Moreuselessstuff.; STRING 275 1070 Last descriptor info. /Lastuselesspart.

Now I would like to extract the two integers and the information that follows up to the period then ignore the everything till either the end of the string or till the semicolon. So I would hope to end up with:

[('1', '160', 'Some descriptor information'), ('161', '274', 'Some other descriptor information'), ('275', '1070', 'Last descriptor info')]

I've tried:

import re
s = "STRING 1 160 Some descriptor information. /Uselessstuff.; STRING 161 274 Some other descriptor information. /Moreuselessstuff.; STRING 275 1070 Last descriptor info. /Lastuselesspart."
re.findall(r'(\d+)\s(\d+)\s(\w+)', s)

However, this only gives the following:

[('1', '160', 'Some'), ('161', '274', 'Some'), ('275', '1070', 'Last')]

How can I get the rest of the information up to the period?

3 Answers 3

3

Your regex would be,

(\d+)\s(\d+)\s([^\.]*)

DEMO

Your python code would be,

>>> s = "STRING 1 160 Some descriptor information. /Uselessstuff.; STRING 161 274 Some other descriptor information. /Moreuselessstuff.; STRING 275 1070 Last descriptor info. /Lastuselesspart."
>>> m = re.findall(r'(\d+)\s(\d+)\s([^\.]*)', s)
>>> m
[('1', '160', 'Some descriptor information'), ('161', '274', 'Some other descriptor information'), ('275', '1070', 'Last descriptor info')]

Explanation:

  • (\d+) Captures one or more digits into a group.
  • \s Above captured digits would be followed by a space.
  • (\d+) Again one or more digits are captured into second group.
  • \s Followed by a single space.
  • ([^\.]*) Captures any character not of a literal dot zero or more times.
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the explanation! It always helps with regex.
3

Using [^.]+ instead of \w+ will select all characters up to a period.

Comments

3

You can use a Character Class to allow only word characters and whitespace.

>>> re.findall(r'(\d+)\s*(\d+)\s*([\w\s]+)', s)
[('1', '160', 'Some descriptor information'), ('161', '274', 'Some other descriptor information'), ('275', '1070', 'Last descriptor info')]

Working Demo

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.