Skip to main content
HomeAbout PythonLearn Python

Detecting Fake News with Scikit-Learn

This scikit-learn tutorial will walk you through building a fake news classifier with the help of Bayesian models.
Aug 2017  · 15 min read
datacamp graphic

Detecting so-called “fake news” is no easy task. First, there is defining what fake news is – given it has now become a political statement. If you can find or agree upon a definition, then you must collect and properly label real and fake news (hopefully on similar topics to best show clear distinctions). Once collected, you must then find useful features to determine fake from real news.

For a more in-depth look at this problem space, I recommend taking a look at Miguel Martinez-Alvarez’s post “How can Machine Learning and AI Help Solve the Fake News Problem”.

Around the same time I read Miguel’s insightful post, I came across an open data science post about building a successful fake news detector with Bayesian models. The author even created a repository with the dataset of tagged fake and real news examples. I was curious if I could easily reproduce the results, and if I could then determine what the model had learned.

In this tutorial, you’ll walk through some of my initial exploration together and see if you can build a successful fake news detector!

Tip: if you want to learn more about Natural Language Processing (NLP) basics, consider taking the Natural Language Processing Fundamentals in Python course.

Data Exploration

To begin, you should always take a quick look at the data and get a feel for its contents. To do so, use a Pandas DataFrame and check the shape, head and apply any necessary transformations.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZCIsInNhbXBsZSI6IiMgSW1wb3J0IGBmYWtlX29yX3JlYWxfbmV3cy5jc3ZgIFxuZGYgPSBfXy5fX19fX19fXyhcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2Zha2Vfb3JfcmVhbF9uZXdzLmNzdlwiKVxuICAgIFxuIyBJbnNwZWN0IHNoYXBlIG9mIGBkZmAgXG5kZi5fX19fX1xuXG4jIFByaW50IGZpcnN0IGxpbmVzIG9mIGBkZmBcbmRmLl9fX19fXyIsInNvbHV0aW9uIjoiIyBJbXBvcnQgYGZha2Vfb3JfcmVhbF9uZXdzLmNzdmAgXG5kZiA9IHBkLnJlYWRfY3N2KFwiaHR0cHM6Ly9zMy5hbWF6b25hd3MuY29tL2Fzc2V0cy5kYXRhY2FtcC5jb20vYmxvZ19hc3NldHMvZmFrZV9vcl9yZWFsX25ld3MuY3N2XCIpXG4gICAgXG4jIEluc3BlY3Qgc2hhcGUgb2YgYGRmYCBcbmRmLnNoYXBlXG5cbiMgUHJpbnQgZmlyc3QgbGluZXMgb2YgYGRmYCBcbmRmLmhlYWQoKSIsInNjdCI6InRlc3Rfb2JqZWN0KFwiZGZcIiwgdW5kZWZpbmVkX21zZz1cIkRpZCB5b3UgaW1wb3J0IHRoZSBkYXRhIGNvcnJlY3RseT9cIixpbmNvcnJlY3RfbXNnPVwiTWFrZSBzdXJlIHRvIHVzZSB0aGUgUGFuZGFzIGByZWFkX2NzdigpYCBmdW5jdGlvbiB0byByZWFkIGluIHRoZSBkYXRhIVwiKVxudGVzdF9vYmplY3RfYWNjZXNzZWQoXCJkZi5zaGFwZVwiLCBub3RfYWNjZXNzZWRfbXNnPVwiRGlkIHlvdSB1c2UgYHNoYXBlYCB0byByZXRyaWV2ZSB0aGUgc2hhcGUgb2YgdGhlIGBkZmA/XCIpXG50ZXN0X2Z1bmN0aW9uKFwiZGYuaGVhZFwiKVxuc3VjY2Vzc19tc2coXCJOaWNlIG9uZSFcIikifQ==
 
eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZGYgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2Zha2Vfb3JfcmVhbF9uZXdzLmNzdlwiKSIsInNhbXBsZSI6IiMgU2V0IGluZGV4XG5kZiA9IGRmLnNldF9pbmRleChcIlVubmFtZWQ6IDBcIikgXG5cbiMgUHJpbnQgZmlyc3QgbGluZXMgb2YgYGRmYCBcbmRmLl9fX19fXyIsInNvbHV0aW9uIjoiIyBTZXQgaW5kZXggXG5kZiA9IGRmLnNldF9pbmRleChcIlVubmFtZWQ6IDBcIilcblxuIyBQcmludCBmaXJzdCBsaW5lcyBvZiBgZGZgIFxuZGYuaGVhZCgpIiwic2N0IjoidGVzdF9vYmplY3QoXCJkZlwiKVxudGVzdF9mdW5jdGlvbihcImRmLmhlYWRcIilcbnN1Y2Nlc3NfbXNnKFwiQXdlc29tZSBqb2IhXCIpIn0=

Extracting the Training Data

Now that the DataFrame looks closer to what you need, you want to separate the labels and set up training and test datasets.

For this notebook, I decided to focus on using the longer article text. Because I knew I would be using bag-of-words and Term Frequency–Inverse Document Frequency (TF-IDF) to extract features, this seemed like a good choice. Using longer text will hopefully allow for distinct words and features for my real and fake news data.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiZnJvbSBza2xlYXJuLm1vZGVsX3NlbGVjdGlvbiBpbXBvcnQgdHJhaW5fdGVzdF9zcGxpdFxuaW1wb3J0IHBhbmRhcyBhcyBwZFxuZGYgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2Zha2Vfb3JfcmVhbF9uZXdzLmNzdlwiKVxuZGYgPSBkZi5zZXRfaW5kZXgoXCJVbm5hbWVkOiAwXCIpICIsInNhbXBsZSI6IiNTZXQgYHlgIFxueSA9IGRmLl9fX19fIFxuIFxuIyBEcm9wIHRoZSBgbGFiZWxgIGNvbHVtbiBcbmRmLl9fX18oXCJsYWJlbFwiLCBheGlzPTEpIFxuIFxuIyBNYWtlIHRyYWluaW5nIGFuZCB0ZXN0IHNldHMgXG5YX3RyYWluLCBYX3Rlc3QsIHlfdHJhaW4sIHlfdGVzdCA9IF9fX19fX19fX19fX19fX18oZGZbJ3RleHQnXSwgeSwgdGVzdF9zaXplPTAuMzMsIHJhbmRvbV9zdGF0ZT01MykiLCJzb2x1dGlvbiI6IiMgU2V0IGB5YCBcbnkgPSBkZi5sYWJlbCBcblxuIyBEcm9wIHRoZSBgbGFiZWxgIGNvbHVtblxuZGYuZHJvcChcImxhYmVsXCIsIGF4aXM9MSlcblxuIyBNYWtlIHRyYWluaW5nIGFuZCB0ZXN0IHNldHMgXG5YX3RyYWluLCBYX3Rlc3QsIHlfdHJhaW4sIHlfdGVzdCA9IHRyYWluX3Rlc3Rfc3BsaXQoZGZbJ3RleHQnXSwgeSwgdGVzdF9zaXplPTAuMzMsIHJhbmRvbV9zdGF0ZT01MykiLCJzY3QiOiJ0ZXN0X29iamVjdChcInlcIilcbnRlc3Rfb2JqZWN0KFwiZGZcIilcbnRlc3Rfb2JqZWN0KFwiWF90ZXN0XCIsIGRvX2V2YWw9RmFsc2UsIHVuZGVmaW5lZF9tc2c9XCJEaWQgeW91IGRlZmluZSBgWF90ZXN0YD9cIilcbnRlc3Rfb2JqZWN0KFwiWF90cmFpblwiLCBkb19ldmFsPUZhbHNlLCB1bmRlZmluZWRfbXNnPVwiRGlkIHlvdSBkZWZpbmUgYFhfdHJhaW5gP1wiKVxudGVzdF9vYmplY3QoXCJ5X3RyYWluXCIsIGRvX2V2YWw9RmFsc2UsIHVuZGVmaW5lZF9tc2c9XCJEaWQgeW91IGRlZmluZSBgeV90cmFpbmA/XCIpXG50ZXN0X29iamVjdChcInlfdGVzdFwiLCBkb19ldmFsPUZhbHNlLCB1bmRlZmluZWRfbXNnPVwiRGlkIHlvdSBkZWZpbmUgYHlfdGVzdGA/XCIpXG5zdWNjZXNzX21zZyhcIkF3ZXNvbWUgam9iIVwiKSJ9

Building Vectorizer Classifiers

Now that you have your training and testing data, you can build your classifiers. To get a good idea if the words and tokens in the articles had a significant impact on whether the news was fake or real, you begin by using CountVectorizer and TfidfVectorizer.

You’ll see the example has a max threshhold set at .7 for the TF-IDF vectorizer tfidf_vectorizer using the max_df argument. This removes words which appear in more than 70% of the articles. Also, the built-in stop_words parameter will remove English stop words from the data before making vectors.

There are many more parameters available and you can read all about them in the scikit-learn documentation for TfidfVectorizer and CountVectorizer.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZnJvbSBza2xlYXJuLm1vZGVsX3NlbGVjdGlvbiBpbXBvcnQgdHJhaW5fdGVzdF9zcGxpdCBcbmZyb20gc2tsZWFybi5mZWF0dXJlX2V4dHJhY3Rpb24udGV4dCBpbXBvcnQgQ291bnRWZWN0b3JpemVyIFxuZGYgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2Zha2Vfb3JfcmVhbF9uZXdzLmNzdlwiKVxuZGYgPSBkZi5zZXRfaW5kZXgoJ1VubmFtZWQ6IDAnKVxueSA9IGRmLmxhYmVsXG5kZiA9IGRmLmRyb3AoJ2xhYmVsJywgYXhpcz0xKVxuWF90cmFpbiwgWF90ZXN0LCB5X3RyYWluLCB5X3Rlc3QgPSB0cmFpbl90ZXN0X3NwbGl0KGRmWyd0ZXh0J10sIHksIHRlc3Rfc2l6ZT0wLjMzLCByYW5kb21fc3RhdGU9NTMpIiwic2FtcGxlIjoiIyBJbml0aWFsaXplIHRoZSBgY291bnRfdmVjdG9yaXplcmAgXG5jb3VudF92ZWN0b3JpemVyID0gX19fX19fX19fX19fX19fKHN0b3Bfd29yZHM9J2VuZ2xpc2gnKSBcblxuIyBGaXQgYW5kIHRyYW5zZm9ybSB0aGUgdHJhaW5pbmcgZGF0YSBcbmNvdW50X3RyYWluID0gY291bnRfdmVjdG9yaXplci5fX19fX19fX19fX19fKFhfdHJhaW4pIFxuXG4jIFRyYW5zZm9ybSB0aGUgdGVzdCBzZXRcbmNvdW50X3Rlc3QgPSBjb3VudF92ZWN0b3JpemVyLl9fX19fX19fXyhYX3Rlc3QpIiwic29sdXRpb24iOiIjIEluaXRpYWxpemUgdGhlIGBjb3VudF92ZWN0b3JpemVyYCBcbmNvdW50X3ZlY3Rvcml6ZXIgPSBDb3VudFZlY3Rvcml6ZXIoc3RvcF93b3Jkcz0nZW5nbGlzaCcpXG5cbiMgRml0IGFuZCB0cmFuc2Zvcm0gdGhlIHRyYWluaW5nIGRhdGEgXG5jb3VudF90cmFpbiA9IGNvdW50X3ZlY3Rvcml6ZXIuZml0X3RyYW5zZm9ybShYX3RyYWluKSBcblxuIyBUcmFuc2Zvcm0gdGhlIHRlc3Qgc2V0IFxuY291bnRfdGVzdCA9IGNvdW50X3ZlY3Rvcml6ZXIudHJhbnNmb3JtKFhfdGVzdCkiLCJzY3QiOiJ0ZXN0X29iamVjdChcImNvdW50X3ZlY3Rvcml6ZXJcIiwgZG9fZXZhbD1GYWxzZSlcbnRlc3Rfb2JqZWN0KFwiY291bnRfdHJhaW5cIiwgZG9fZXZhbD1GYWxzZSlcbnRlc3Rfb2JqZWN0KFwiY291bnRfdGVzdFwiLCBkb19ldmFsPUZhbHNlLCB1bmRlZmluZWRfbXNnPVwiRGlkIHlvdSBkZWZpbmUgYGNvdW50X3Rlc3RgP1wiKVxuc3VjY2Vzc19tc2coXCJHcmVhdCBqb2IhXCIpIn0=
 
eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZnJvbSBza2xlYXJuLm1vZGVsX3NlbGVjdGlvbiBpbXBvcnQgdHJhaW5fdGVzdF9zcGxpdCBcbmZyb20gc2tsZWFybi5mZWF0dXJlX2V4dHJhY3Rpb24udGV4dCBpbXBvcnQgVGZpZGZWZWN0b3JpemVyIFxuZGYgPSBwZC5yZWFkX2NzdihcImh0dHBzOi8vczMuYW1hem9uYXdzLmNvbS9hc3NldHMuZGF0YWNhbXAuY29tL2Jsb2dfYXNzZXRzL2Zha2Vfb3JfcmVhbF9uZXdzLmNzdlwiKVxuZGYgPSBkZi5zZXRfaW5kZXgoJ1VubmFtZWQ6IDAnKVxueSA9IGRmLmxhYmVsXG5kZiA9IGRmLmRyb3AoJ2xhYmVsJywgYXhpcz0xKVxuWF90cmFpbiwgWF90ZXN0LCB5X3RyYWluLCB5X3Rlc3QgPSB0cmFpbl90ZXN0X3NwbGl0KGRmWyd0ZXh0J10sIHksIHRlc3Rfc2l6ZT0wLjMzLCByYW5kb21fc3RhdGU9NTMpIiwic2FtcGxlIjoiIyBJbml0aWFsaXplIHRoZSBgdGZpZGZfdmVjdG9yaXplcmBcbnRmaWRmX3ZlY3Rvcml6ZXIgPSBfX19fX19fX19fX19fX18oc3RvcF93b3Jkcz0nZW5nbGlzaCcsIG1heF9kZj0wLjcpIFxuXG4jIEZpdCBhbmQgdHJhbnNmb3JtIHRoZSB0cmFpbmluZyBkYXRhIFxudGZpZGZfdHJhaW4gPSB0ZmlkZl92ZWN0b3JpemVyLl9fX19fX19fX19fX18oWF90cmFpbikgXG5cbiMgVHJhbnNmb3JtIHRoZSB0ZXN0IHNldCBcbnRmaWRmX3Rlc3QgPSB0ZmlkZl92ZWN0b3JpemVyLl9fX19fX19fXyhYX3Rlc3QpIiwic29sdXRpb24iOiIjIEluaXRpYWxpemUgdGhlIGB0ZmlkZl92ZWN0b3JpemVyYCBcbnRmaWRmX3ZlY3Rvcml6ZXIgPSBUZmlkZlZlY3Rvcml6ZXIoc3RvcF93b3Jkcz0nZW5nbGlzaCcsIG1heF9kZj0wLjcpIFxuXG4jIEZpdCBhbmQgdHJhbnNmb3JtIHRoZSB0cmFpbmluZyBkYXRhIFxudGZpZGZfdHJhaW4gPSB0ZmlkZl92ZWN0b3JpemVyLmZpdF90cmFuc2Zvcm0oWF90cmFpbikgXG5cbiMgVHJhbnNmb3JtIHRoZSB0ZXN0IHNldCBcbnRmaWRmX3Rlc3QgPSB0ZmlkZl92ZWN0b3JpemVyLnRyYW5zZm9ybShYX3Rlc3QpIiwic2N0IjoidGVzdF9vYmplY3QoXCJ0ZmlkZl92ZWN0b3JpemVyXCIsIGRvX2V2YWw9RmFsc2UpXG50ZXN0X29iamVjdChcInRmaWRmX3RyYWluXCIsIGRvX2V2YWw9RmFsc2UpXG50ZXN0X29iamVjdChcInRmaWRmX3Rlc3RcIiwgZG9fZXZhbD1GYWxzZSwgdW5kZWZpbmVkX21zZz1cIkRpZCB5b3UgZGVmaW5lIGBjb3VudF90ZXN0YD9cIilcbnN1Y2Nlc3NfbXNnKFwiV2VsbCBkb25lIVwiKSJ9

Now that you have vectors, you can then take a look at the vector features, stored in count_vectorizer and tfidf_vectorizer.

Are there any noticable issues? (Yes!)

There are clearly comments, measurements or other nonsensical words as well as multilingual articles in the dataset that you have been using. Normally, you would want to spend more time preprocessing this and removing noise, but as this tutorial just showcases a small proof of concept, you will see if the model can overcome the noise and properly classify despite these issues.

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZnJvbSBza2xlYXJuLm1vZGVsX3NlbGVjdGlvbiBpbXBvcnQgdHJhaW5fdGVzdF9zcGxpdFxuZnJvbSBza2xlYXJuLmZlYXR1cmVfZXh0cmFjdGlvbi50ZXh0IGltcG9ydCBUZmlkZlZlY3Rvcml6ZXIsIENvdW50VmVjdG9yaXplciBcbmRmID0gcGQucmVhZF9jc3YoXCJodHRwczovL3MzLmFtYXpvbmF3cy5jb20vYXNzZXRzLmRhdGFjYW1wLmNvbS9ibG9nX2Fzc2V0cy9mYWtlX29yX3JlYWxfbmV3cy5jc3ZcIilcbmRmID0gZGYuc2V0X2luZGV4KCdVbm5hbWVkOiAwJylcbnkgPSBkZi5sYWJlbFxuZGYgPSBkZi5kcm9wKCdsYWJlbCcsIGF4aXM9MSlcblhfdHJhaW4sIFhfdGVzdCwgeV90cmFpbiwgeV90ZXN0ID0gdHJhaW5fdGVzdF9zcGxpdChkZlsndGV4dCddLCB5LCB0ZXN0X3NpemU9MC4zMywgcmFuZG9tX3N0YXRlPTUzKVxudGZpZGZfdmVjdG9yaXplciA9IFRmaWRmVmVjdG9yaXplcihzdG9wX3dvcmRzPSdlbmdsaXNoJywgbWF4X2RmPTAuNylcbnRmaWRmX3RyYWluID0gdGZpZGZfdmVjdG9yaXplci5maXRfdHJhbnNmb3JtKFhfdHJhaW4pXG50ZmlkZl90ZXN0ID0gdGZpZGZfdmVjdG9yaXplci50cmFuc2Zvcm0oWF90ZXN0KVxuY291bnRfdmVjdG9yaXplciA9IENvdW50VmVjdG9yaXplcihzdG9wX3dvcmRzPSdlbmdsaXNoJylcbmNvdW50X3RyYWluID0gY291bnRfdmVjdG9yaXplci5maXRfdHJhbnNmb3JtKFhfdHJhaW4pIFxuY291bnRfdGVzdCA9IGNvdW50X3ZlY3Rvcml6ZXIudHJhbnNmb3JtKFhfdGVzdCkiLCJzYW1wbGUiOiIjIEdldCB0aGUgZmVhdHVyZSBuYW1lcyBvZiBgdGZpZGZfdmVjdG9yaXplcmAgXG5wcmludCh0ZmlkZl92ZWN0b3JpemVyLl9fX19fX19fX19fKClbLTEwOl0pXG5cbiMgR2V0IHRoZSBmZWF0dXJlIG5hbWVzIG9mIGBjb3VudF92ZWN0b3JpemVyYCBcbnByaW50KGNvdW50X3ZlY3Rvcml6ZXIuX19fX19fX19fX18oKVs6MTBdKSIsInNvbHV0aW9uIjoiIyBHZXQgdGhlIGZlYXR1cmUgbmFtZXMgb2YgYHRmaWRmX3ZlY3Rvcml6ZXJgIFxucHJpbnQodGZpZGZfdmVjdG9yaXplci5nZXRfZmVhdHVyZV9uYW1lcygpWy0xMDpdKVxuXG4jIEdldCB0aGUgZmVhdHVyZSBuYW1lcyBvZiBgY291bnRfdmVjdG9yaXplcmAgXG5wcmludChjb3VudF92ZWN0b3JpemVyLmdldF9mZWF0dXJlX25hbWVzKClbOjEwXSkiLCJzY3QiOiJ0ZXN0X2Z1bmN0aW9uKFwicHJpbnRcIiwgMSlcbnRlc3RfZnVuY3Rpb24oXCJwcmludFwiLCAyKVxuc3VjY2Vzc19tc2coXCJXZWxsIGRvbmUhXCIpIn0=

Intermezzo: Count versus TF-IDF Features

I was curious if my count and TF-IDF vectorizers had extracted different tokens. To take a look and compare features, you can extract the vector information back into a DataFrame to use easy Python comparisons.

As you can see by running the cells below, both vectorizers extracted the same tokens, but obviously have different weights. Likely, changing the max_df and min_df of the TF-IDF vectorizer could alter the result and lead to different features in each.

In [15]:
 count_df = pd.DataFrame(count_train.A, columns=count_vectorizer.get_feature_names())
In [16]:
tfidf_df = pd.DataFrame(tfidf_train.A, columns=tfidf_vectorizer.get_feature_names())
In [17]:
difference = set(count_df.columns) - set(tfidf_df.columns)difference
Out[17]:
set()
In [18]:
print(count_df.equals(tfidf_df))
False
In [19]:
count_df.head()
Out[19]:
 000000000000000310000350000600010001pt000ft000km...حلبعربيعنلممامحاولاتمنهذاوالمرضىยงade
00000000000...0000000000
10000000000...0000000000
20000000000...0000000000
30000000000...0000000000
40000000000...0000000000

5 rows × 56922 columns

In [20]:
tfidf_df.head()
Out[20]:
 000000000000000310000350000600010001pt000ft000km...حلبعربيعنلممامحاولاتمنهذاوالمرضىยงade
00.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
10.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
20.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
30.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
40.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0

5 rows × 56922 columns

Comparing Models

Now it's time to train and test your models.

Here, you'll begin with an NLP favorite, MultinomialNB. You can use this to compare TF-IDF versus bag-of-words. My intuition was that bag-of-words (aka CountVectorizer) would perform better with this model. (For more reading on multinomial distribution and why it works best with integers, check out this fairly succinct explanation from a UPenn statistics course).

I personally find Confusion Matrices easier to compare and read, so I used the scikit-learn documentation to build some easily-readable confusion matrices (thanks open source!). A confusion matrix shows the proper labels on the main diagonal (top left to bottom right). The other cells show the incorrect labels, often referred to as false positives or false negatives. Depending on your problem, one of these might be more significant. For example, for the fake news problem, is it more important that we don't label real news articles as fake news? If so, we might want to eventually weight our accuracy score to better reflect this concern.

Other than Confusion Matrices, scikit-learn comes with many ways to visualize and compare your models. One popular way is to use a ROC Curve. There are many other ways to evaluate your model available in the scikit-learn metrics module.

In [21]:
def plot_confusion_matrix(cm, classes,                          normalize=False,                          title='Confusion matrix',                          cmap=plt.cm.Blues):    """    See full source and example:     http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html        This function prints and plots the confusion matrix.    Normalization can be applied by setting `normalize=True`.    """    plt.imshow(cm, interpolation='nearest', cmap=cmap)    plt.title(title)    plt.colorbar()    tick_marks = np.arange(len(classes))    plt.xticks(tick_marks, classes, rotation=45)    plt.yticks(tick_marks, classes)    if normalize:        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]        print("Normalized confusion matrix")    else:        print('Confusion matrix, without normalization')    thresh = cm.max() / 2.    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):        plt.text(j, i, cm[i, j],                 horizontalalignment="center",                 color="white" if cm[i, j] > thresh else "black")    plt.tight_layout()    plt.ylabel('True label')    plt.xlabel('Predicted label')
In [22]:
clf = MultinomialNB()
In [23]:
 clf.fit(tfidf_train, y_train)pred = clf.predict(tfidf_test)score = metrics.accuracy_score(y_test, pred)print("accuracy:   %0.3f" % score)cm = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])plot_confusion_matrix(cm, classes=['FAKE', 'REAL'])
accuracy:   0.857Confusion matrix, without normalization
 
confusion matrix 1
In [24]:
clf = MultinomialNB() 
In [25]:
clf.fit(count_train, y_train)pred = clf.predict(count_test)score = metrics.accuracy_score(y_test, pred)print("accuracy:   %0.3f" % score)cm = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])plot_confusion_matrix(cm, classes=['FAKE', 'REAL'])
accuracy:   0.893Confusion matrix, without normalization
 
confusion matrix 2
 

And indeed, with absolutely no parameter tuning, your count vectorized training set count_train is visibly outperforming your TF-IDF vectors!

Testing Linear Models

There are a lot of great write-ups about how linear models work well with TF-IDF vectorizers (take a look at word2vec for classification, SVM reference in scikit-learn text analysis, and many more).

So you should use a SVM, right?

Well, I recently watched Victor Lavrenko's lecture on text classification and he compares Passive Aggressive classifiers to linear SVMs for text classification. We'll test this approach (which has some significant speed benefits and permanent learning disadvantages) with the fake news dataset.

In [26]:
linear_clf = PassiveAggressiveClassifier(n_iter=50)
In [27]:
 linear_clf.fit(tfidf_train, y_train)pred = linear_clf.predict(tfidf_test)score = metrics.accuracy_score(y_test, pred)print("accuracy:   %0.3f" % score)cm = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])plot_confusion_matrix(cm, classes=['FAKE', 'REAL'])
accuracy:   0.936Confusion matrix, without normalization
confusion matrix 3

Wow!

I'm impressed. The confusion matrix looks different and the model classifies our fake news a bit better. We can test if tuning the alpha value for a MultinomialNB creates comparable results. You can also use parameter tuning with grid search for a more exhaustive search.

In [28]:
clf = MultinomialNB(alpha=0.1)
In [29]:
 last_score = 0for alpha in np.arange(0,1,.1):    nb_classifier = MultinomialNB(alpha=alpha)    nb_classifier.fit(tfidf_train, y_train)    pred = nb_classifier.predict(tfidf_test)    score = metrics.accuracy_score(y_test, pred)    if score > last_score:        clf = nb_classifier    print("Alpha: {:.2f} Score: {:.5f}".format(alpha, score))
Alpha: 0.00 Score: 0.61502Alpha: 0.10 Score: 0.89766Alpha: 0.20 Score: 0.89383Alpha: 0.30 Score: 0.89000Alpha: 0.40 Score: 0.88570Alpha: 0.50 Score: 0.88427Alpha: 0.60 Score: 0.87470Alpha: 0.70 Score: 0.87040Alpha: 0.80 Score: 0.86609Alpha: 0.90 Score: 0.85892

Not quite... At this point, it might be interesting to perform parameter tuning across all of the classifiers, or take a look at some other scikit-learn Bayesian classifiers. You could also test with a Support Vector Machine (SVM) to see if that outperforms the Passive Aggressive classifier.

But I am a bit more curious about what the Passive Aggressive model actually has learned. So let's move onto introspection.

Introspecting models

So fake news is solved, right? We achieved 93% accuracy on my dataset so let's all close up shop and go home.

Not quite, of course. I am wary at best of these results given how much noise we saw in the features. There is a great write-up on StackOverflow with this incredibly useful function for finding vectors that most affect labels. It only works for binary classificaiton (classifiers with 2 classes), but that's good news for you, since you only have FAKE or REAL labels.

Using your best performing classifier with your TF-IDF vector dataset (tfidf_vectorizer) and Passive Aggressive classifier (linear_clf), inspect the top 30 vectors for fake and real news:

In [30]:
FAKE -4.86382369883 2016FAKE -4.13847157932 hillaryFAKE -3.98994974843 octoberFAKE -3.10552662226 shareFAKE -2.99713810694 novemberFAKE -2.9150746075 articleFAKE -2.54532100449 printFAKE -2.47115243995 advertisementFAKE -2.35915304509 sourceFAKE -2.31585837413 emailFAKE -2.27985826579 electionFAKE -2.2736680857 octFAKE -2.25253568246 warFAKE -2.19663276969 mosulFAKE -2.17921304122 podestaFAKE -1.99361009573 novFAKE -1.98662624907 comFAKE -1.9452527887 establishmentFAKE -1.86869495684 corporateFAKE -1.84166664376 wikileaksFAKE -1.7936566878 26FAKE -1.75686475396 donaldFAKE -1.74951154055 snipFAKE -1.73298170472 mainstreamFAKE -1.71365596627 ukFAKE -1.70917804969 ayotteFAKE -1.70781651904 entireFAKE -1.68272667818 jewishFAKE -1.65334397724 youtubeFAKE -1.6241703128 pipelineREAL 4.78064061698 saidREAL 2.68703967567 tuesdayREAL 2.48309800829 gopREAL 2.45710670245 islamicREAL 2.44326123901 saysREAL 2.29424417889 cruzREAL 2.29144842597 marriageREAL 2.20500735471 candidatesREAL 2.19136552672 conservativeREAL 2.18030834903 mondayREAL 2.05688105375 attacksREAL 2.03476457362 rushREAL 1.9954523319 continueREAL 1.97002430576 fridayREAL 1.95034103105 conventionREAL 1.94620720989 senREAL 1.91185661202 jobsREAL 1.87501303774 debateREAL 1.84059602241 presumptiveREAL 1.80111133252 sayREAL 1.80027216061 sundayREAL 1.79650823765 marchREAL 1.79229792108 parisREAL 1.74587899553 securityREAL 1.69585506276 conservativesREAL 1.68860503431 recountsREAL 1.67424302821 dealREAL 1.67343398121 campaignREAL 1.66148582079 foxREAL 1.61425630518 attack

You can also do this in a pretty obvious way with only a few lines of Python, by zipping your coefficients to your features and taking a look at the top and bottom of your list.

In [31]:
feature_names = tfidf_vectorizer.get_feature_names()
In [32]:
### Most realsorted(zip(clf.coef_[0], feature_names), reverse=True)[:20]
Out[32]:
[(-6.2573612147015822, 'trump'), (-6.4944530943126777, 'said'), (-6.6539784739838845, 'clinton'), (-7.0379446628670728, 'obama'), (-7.1465399833812278, 'sanders'), (-7.2153760086475112, 'president'), (-7.2665628057416169, 'campaign'), (-7.2875931446681514, 'republican'), (-7.3411184585990643, 'state'), (-7.3413571102479054, 'cruz'), (-7.3783124419854254, 'party'), (-7.4468806724578904, 'new'), (-7.4762888011545883, 'people'), (-7.547225599514773, 'percent'), (-7.5553074094582335, 'bush'), (-7.5801506339098932, 'republicans'), (-7.5855405012652435, 'house'), (-7.6344781725203141, 'voters'), (-7.6484824436952987, 'rubio'), (-7.6734836186463795, 'states')]
In [33]:
### Most fakesorted(zip(clf.coef_[0], feature_names))[:20]
Out[33]:
[(-11.349866225220305, '0000'), (-11.349866225220305, '000035'), (-11.349866225220305, '0001'), (-11.349866225220305, '0001pt'), (-11.349866225220305, '000km'), (-11.349866225220305, '0011'), (-11.349866225220305, '006s'), (-11.349866225220305, '007'), (-11.349866225220305, '007s'), (-11.349866225220305, '008s'), (-11.349866225220305, '0099'), (-11.349866225220305, '00am'), (-11.349866225220305, '00p'), (-11.349866225220305, '00pm'), (-11.349866225220305, '014'), (-11.349866225220305, '015'), (-11.349866225220305, '018'), (-11.349866225220305, '01am'), (-11.349866225220305, '020'), (-11.349866225220305, '023')]

So, clearly there are certain words which might show political intent and source in the top fake features (such as the words corporate and establishment).

Also, the real news data uses forms of the verb "to say" more often, likely because in newspapers and most journalistic publications sources are quoted directly ("German Chancellor Angela Merkel said...").

To extract the full list from your current classifier and take a look at each token (or easily compare tokens from classifier to classifier), you can easily export it like so.

In [34]:
tokens_with_weights = sorted(list(zip(feature_names, clf.coef_[0])))

Intermezzo: HashingVectorizer

Another vectorizer used sometimes for text classification is a HashingVectorizer. HashingVectorizers require less memory and are faster (because they are sparse and use hashes rather than tokens) but are more difficult to introspect. You can read a bit more about the pros and cons of using HashingVectorizer in the scikit-learn documentation if you are interested.

You can give it a try and compare its results versus the other vectorizers. It performs fairly well, with better results than the TF-IDF vectorizer using MultinomialNB (this is somewhat expected due to the same reasons CountVectorizers perform better), but not as well as the TF-IDF vectorizer with Passive Aggressive linear algorithm.

In [35]:
hash_vectorizer = HashingVectorizer(stop_words='english', non_negative=True)hash_train = hash_vectorizer.fit_transform(X_train)hash_test = hash_vectorizer.transform(X_test)
In [36]:
clf = MultinomialNB(alpha=.01)
In [37]:
clf.fit(hash_train, y_train)pred = clf.predict(hash_test)score = metrics.accuracy_score(y_test, pred)print("accuracy:   %0.3f" % score)cm = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])plot_confusion_matrix(cm, classes=['FAKE', 'REAL'])
accuracy:   0.902Confusion matrix, without normalization
 
confusion matrix 4
In [38]:
clf = PassiveAggressiveClassifier(n_iter=50)
In [39]:
clf.fit(hash_train, y_train)pred = clf.predict(hash_test)score = metrics.accuracy_score(y_test, pred)print("accuracy:   %0.3f" % score)cm = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])plot_confusion_matrix(cm, classes=['FAKE', 'REAL'])
accuracy:   0.921Confusion matrix, without normalization
confusion matrix 5

Conclusion

So was your fake news classifier experiment a success? Definitely not.

But you did get to play around with a new dataset, test out some NLP classification models and introspect how successful they were? Yes.

As expected from the outset, defining fake news with simple bag-of-words or TF-IDF vectors is an oversimplified approach. Especially with a multilingual dataset full of noisy tokens. If you hadn't taken a look at what your model had actually learned, you might have thought the model learned something meaningful. So, remember: always introspect your models (as best you can!).

I would be curious if you find other trends in the data I might have missed. I will be following up with a post on how different classifiers compare in terms of important features on my blog. If you spend some time researching and find anything interesting, feel free to share your findings and notes in the comments or you can always reach out on Twitter (I'm @kjam).

I hope you had some fun exploring a new NLP dataset with me!

Take a look at DataCamp's Python Machine Learning: Scikit-Learn Tutorial.

Topics

Machine Learning Courses

Course

Supervised Learning with scikit-learn

4 hr
103.3K
Grow your machine learning skills with scikit-learn in Python. Use real-world datasets in this interactive course and learn how to make powerful predictions!
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

How to Learn Machine Learning in 2024

Discover how to learn machine learning in 2024, including the key skills and technologies you’ll need to master, as well as resources to help you get started.
Adel Nehme's photo

Adel Nehme

15 min

Test-Driven Development in Python: A Beginner's Guide

Dive into test-driven development (TDD) with our comprehensive Python tutorial. Learn how to write robust tests before coding with practical examples.
Amina Edmunds's photo

Amina Edmunds

7 min

Exponents in Python: A Comprehensive Guide for Beginners

Master exponents in Python using various methods, from built-in functions to powerful libraries like NumPy, and leverage them in real-world scenarios to gain a deeper understanding.
Satyam Tripathi's photo

Satyam Tripathi

9 min

OpenCV Tutorial: Unlock the Power of Visual Data Processing

This article provides a comprehensive guide on utilizing the OpenCV library for image and video processing within a Python environment. We dive into the wide range of image processing functionalities OpenCV offers, from basic techniques to more advanced applications.
Richmond Alake's photo

Richmond Alake

13 min

Python Linked Lists: Tutorial With Examples

Learn everything you need to know about linked lists: when to use them, their types, and implementation in Python.
Natassha Selvaraj's photo

Natassha Selvaraj

9 min

An Introduction to the Mamba LLM Architecture: A New Paradigm in Machine Learning

Discover the power of Mamba LLM, a transformative architecture from leading universities, redefining sequence processing in AI.
Kurtis Pykes 's photo

Kurtis Pykes

9 min

See MoreSee More