Skip to content

Commit 4209884

Browse files
committed
Node demo fix, Readme
1 parent 08a7d8f commit 4209884

File tree

3 files changed

+59
-22
lines changed

3 files changed

+59
-22
lines changed

‎4. Naive Bayes Classification.ipynb‎

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
{
22
"metadata": {
33
"name": "",
4-
"signature": "sha256:ad3e77c84b6a81dc5eb1f4aab46bfb3ce903969463e671f758759e3e251a1e38"
4+
"signature": "sha256:0df1e645a00490baf1a4cd243e17fdb996e55ad258e665d8b406f4982d82660f"
55
},
66
"nbformat": 3,
77
"nbformat_minor": 0,
@@ -21,7 +21,7 @@
2121
"cell_type": "markdown",
2222
"metadata": {},
2323
"source": [
24-
"This is an example of going from labeled text to machine classification, first with NLTK and then the Python machine learning library scikit-learn. Examples updated from my OpenVis Conf talk here, which is more entertaining: https://www.youtube.com/watch?v=f41U936WqPM\n",
24+
"This is an example of going from labeled text to machine classification, first with NLTK and then the Python machine learning library scikit-learn. Examples updated from my OpenVis Conf talk here, which is more entertaining: https://www.youtube.com/watch?v=f41U936WqPM and slides: http://www.slideshare.net/arnicas/the-bones-of-a-bestseller\n",
2525
"\n",
2626
"###Warning: Rated NC-17. Using text samples from \"50 Shades of Gray\"! (Because spam is boring.)###\n",
2727
"\n"
@@ -719,6 +719,10 @@
719719
"source": [
720720
"###Perkins discusses some of the differences in performance in his book. Basically, for machine learning problems, sklearn is highly optimized.\n",
721721
"\n",
722+
"Let's look at a visualization of the accuracy of one of the runs, from my [Openvis Conf talk](http://www.slideshare.net/arnicas/the-bones-of-a-bestseller):\n",
723+
"\n",
724+
"http://www.ghostweather.com/essays/talks/openvisconf/text_scores/rollover.html\n",
725+
"\n",
722726
"**Now, optionally, you can look at notebook 5 on using sklearn to do the same things!**"
723727
]
724728
},

‎README.md‎

Lines changed: 23 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,24 @@
11
# NLP-in-Python
2-
Intro to some NLP concepts in Python for a class
2+
3+
Intro to some NLP concepts and libraries in Python for a class at CMU,
4+
Feb 2015.
5+
6+
Lots of libraries are required - see [here](Installation.md) for install info.
7+
8+
Notebook viewer links:
9+
10+
0. [Reading in Files](http://nbviewer.ipython.org/github/arnicas/NLP-in-Python/blob/master/0.%20Reading%20Files.ipynb)
11+
12+
1. [Tokenizing, Stemming, POS](http://nbviewer.ipython.org/github/arnicas/NLP-in-Python/blob/master/1.Tokenizing%2C%20Stemming%2C%20POS.ipynb)
13+
14+
2. [Wordclouds](http://nbviewer.ipython.org/github/arnicas/NLP-in-Python/blob/master/2.%20WordClouds.ipynb)
15+
16+
3. [TF-IDF, Clustering, Pattern](http://nbviewer.ipython.org/github/arnicas/NLP-in-Python/blob/master/3.%20TF-IDF%2C%20Clustering%2C%20Pattern.ipynb)
17+
18+
Doing some TF-IDF NLP in node's package "natural" (but caveats apply):[here](utils/booksNodeTfIdf.js)
19+
20+
4. [Naive Bayes Classification](http://nbviewer.ipython.org/github/arnicas/NLP-in-Python/blob/master/4.%20Naive%20Bayes%20Classification.ipynb) - the infamous 50 Shades Sex Scene Detector
21+
22+
5. [Naive Bayes in Scikit-Learn](http://nbviewer.ipython.org/github/arnicas/NLP-in-Python/blob/master/5.%20Naive%20Bayes%20in%20Scikit-Learn.ipynb)
23+
24+

‎utils/booksNodeTfIdf.js‎

Lines changed: 30 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,33 @@
11

2-
/*
2+
/************
33
4-
This uses code from the natural package, particularly:
4+
Lynn Cherny, 2/2015:
5+
6+
This uses code from the node natural package, particularly:
57
https://github.com/NaturalNode/natural/blob/master/lib/natural/tfidf/tfidf.js
68
which implements tf-idf in its own way and removes stopwords, apparently.
9+
710
https://github.com/NaturalNode/natural/blob/master/lib/natural/util/stopwords.js
811
9-
*/
12+
To run this, you need to
13+
>npm install natural
14+
> node booksNodeTfIdf.js <files dir without trailing / ! > <word>
15+
16+
*************/
1017

1118
var natural = require('natural'),
1219
TfIdf = natural.TfIdf,
1320
tfidf = new TfIdf();
1421

1522
// Borrowed from Shiffman's node examples for his AtoZ Programming class
16-
// We can get command line arguments in a node program
23+
// We can get command line arguments in a node program:
1724
if (process.argv.length < 4) {
1825
console.log('Oops, you forgot to pass in a directory path and file to evaluate.');
1926
process.exit(1);
2027
}
2128

22-
// The 'fs' (file system) module allows us to read and write files
29+
// The 'fs' (file system) module allows us to read and write files:
30+
2331
var fs = require('fs');
2432
// A path for all the files of data
2533
var path = process.argv[2];
@@ -31,30 +39,33 @@ var files = fs.readdirSync(path);
3139
// Total word count of document
3240
var totalwords = 0;
3341

42+
function printTopN(docNumber, N) {
43+
var sorted = tfidf.listTerms(docNumber).sort(function(a, b) {
44+
return b.tfidf - a.tfidf;
45+
}).slice(0,N);
46+
47+
console.log("Top terms for doc " + docNumber + ":")
48+
sorted.forEach(function (a) {
49+
console.log(" ", a.term, a.tfidf);
50+
});
51+
}
52+
53+
// Add each of the files to the model
54+
3455
files.forEach(function(d,i) {
3556
console.log('Doc #' + i + ' is ' + path + '/' + d);
3657
tfidf.addFileSync(path + '/' + d);
3758
});
3859

3960
setTimeout(function () {
40-
// doing it in a timeout because otherwise the files.each above may not have finished
61+
// doing it in a timeout because otherwise the files.each above may not
62+
// have finished
4163
files.forEach(function(d,i) {
4264
printTopN(i, 6);
4365
});
44-
console.log('Looking up tfidf for: ' + word);
66+
console.log('Looking up tfidf for: ' + word);
4567
tfidf.tfidfs(word, function(i, measure) {
46-
console.log('document #' + i + ' is ' + measure);
68+
console.log('document #' + i + ' with ' + measure);
4769
});
4870
}, 500);
4971

50-
function printTopN(docNumber, N) {
51-
var sorted = tfidf.listTerms(docNumber).sort(function(a, b) {
52-
return b.tfidf - a.tfidf;
53-
}).slice(0,N);
54-
55-
console.log("Top terms for doc " + docNumber + ":")
56-
sorted.forEach(function (a) {
57-
console.log(" ", a.term, a.tfidf);
58-
});
59-
}
60-

0 commit comments

Comments
 (0)