Scraping And Finding Ordered Words In A Dictionary using Python
What are ordered words?
An ordered word is a word in which the letters appear in alphabetic order. For example
abbey
&
dirt
. The rest of the words are unordered for example
geeks
The task at hand
This task is taken from Rosetta Code and it is not as mundane as it sounds from the above description. To get a large number of words we will use an online dictionary available on
http://www.puzzlers.org/pub/wordlists/unixdict.txt
which has a collection of about 2,500 words and since we are gonna be using python we can do that by scraping the dictionary instead of downloading it as a text file and then doing some file handling operations on it.
Requirements:
pip install requests
Code
The approach will be to traverse the whole word and compare the ascii values of elements in pairs until we find a false result otherwise the word will be ordered. So this task will be divided in 2 parts:
Scraping
- Using the python library requests we will fetch the data from the given URL
- Store the content fetched from the URL as a string
- Decoding the content which is usually encoded on the web using UTF-8
- Converting the long string of content into a list of words
Finding the ordered words
- Traversing the list of words
- Pairwise comparison of the ASCII value of every adjacent character in each word
- Storing a false result if a pair is unordered
- Otherwise printing the ordered word
# Python program to find ordered words
import requests
# Scrapes the words from the URL below and stores
# them in a list
def getWords():
# contains about 2500 words
url = "http://www.puzzlers.org/pub/wordlists/unixdict.txt"
fetchData = requests.get(url)
# extracts the content of the webpage
wordList = fetchData.content
# decodes the UTF-8 encoded text and splits the
# string to turn it into a list of words
wordList = wordList.decode("utf-8").split()
return wordList
# function to determine whether a word is ordered or not
def isOrdered():
# fetching the wordList
collection = getWords()
# since the first few of the elements of the
# dictionary are numbers, getting rid of those
# numbers by slicing off the first 17 elements
collection = collection[16:]
word = ''
for word in collection:
result = 'Word is ordered'
i = 0
l = len(word) - 1
if (len(word) < 3): # skips the 1 and 2 lettered strings
continue
# traverses through all characters of the word in pairs
while i < l:
if (ord(word[i]) > ord(word[i+1])):
result = 'Word is not ordered'
break
else:
i += 1
# only printing the ordered words
if (result == 'Word is ordered'):
print(word,': ',result)
# execute isOrdered() function
if __name__ == '__main__':
isOrdered()
Output:
aau: Word is ordered
abbe: Word is ordered
abbey: Word is ordered
abbot: Word is ordered
abbott: Word is ordered
abc: Word is ordered
abe: Word is ordered
abel: Word is ordered
abet: Word is ordered
abo: Word is ordered
abort: Word is ordered
accent: Word is ordered
accept: Word is ordered
...........................
...........................
...........................