In [1]:
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import datetime as dt
matplotlib.style.use('seaborn-whitegrid')
df = pd.read_csv('twitter_archive_master.csv')
df['timestamp'] = pd.to_datetime(df['timestamp'])
df.set_index('timestamp', inplace=True)

Retweets, Favorites and Ratings Correlation

In [2]:
df[['favorites', 'retweet_count']].plot(style = '.', alpha = .2)
plt.title('Favorites and Retweets over Time')
plt.xlabel('Date')
plt.ylabel('Count')
Out[2]:
<matplotlib.text.Text at 0x1abf1ce39e8>

Here you can see the gradual incline of both favorites and retweets.

In [3]:
df.plot(y ='rating', ylim=[0,14], style = '.', alpha = .2)
plt.title('Rating Increase over Time')
plt.xlabel('Date')
plt.ylabel('Rating')
Out[3]:
<matplotlib.text.Text at 0x1abf1e61630>

So Bront was right, the ratings are getting higher and higher. Whether the dogs are getting better I'm not sure.

In [4]:
df[['favorites', 'retweet_count', 'rating']].corr(method='pearson')
Out[4]:
favorites retweet_count rating
favorites 1.000000 0.914929 0.023167
retweet_count 0.914929 1.000000 0.023733
rating 0.023167 0.023733 1.000000

So I ran a correlation to see if dogs with higher ratings were getting more favorites and retweets. In my mind, if the dogs are getting better they should be getting more favorites and retweets along with the higher rating. There is definitely a correlation between favorites and retweets. This makes me think that if the tweet is good in general that there will be more retweets and favorites.

Yet there is no correlation between rating and retweets or favorites. There are a few possible explanations. One is that the dogs are not actually getting better. The other is that the 'lower quality' dogs are given funnier captions. In those cases it is the caption getting retweets and favorites, not the dog itself.

Good Boys and Good Girls

In [5]:
df[df['gender'].notnull()]['gender'].value_counts().plot(kind = 'pie')
plt.title('Dog Genders')
Out[5]:
<matplotlib.text.Text at 0x1abf1f6f080>

There were three times more boy dogs identified than girl dogs. Based on what I know about biology, this is likely not representative of the dog population at large. Likely this is due to more male dogs having their sex announced, either by their owner or by @dog_rates or both.

Also I remember one incidence of the word 'girl' that referred to a human girl that was also in the picture. So it is possible that some of the dogs were misgendered by my function. It's also possible that @dog_rates is misgendering some females by using male pronouns as gender-neutral, which is somewhat common. However, since only a small portion of the rated dogs were gendered at all, it is hard to draw concrete conclusions.

Most Rated Breeds

In [6]:
top_breeds=df.groupby('breed').filter(lambda x: len(x) >= 20)
top_breeds['breed'].value_counts().plot(kind = 'barh')
plt.title('Histogram of Most The Rated Breeds')
plt.xlabel('Count')
plt.ylabel('Breed')
Out[6]:
<matplotlib.text.Text at 0x1abf1fcbcf8>

It's difficult to know why these breeds are the top breeds. It could be because they are commonly owned. They could be the favorites of the owner of @dog_rates, so the ones they choose to rate the most. Or they could be the easiest to identify by the AI that identified them. Anecdotally, I have only ever owned the top two breeds.

In [7]:
top_breeds.groupby('breed')['rating'].describe()
Out[7]:
count mean std min 25% 50% 75% max
breed
Cardigan 21.0 11.142857 1.590148 7.0 10.00 11.00 12.0 13.0
Chesapeake_Bay_retriever 31.0 10.741935 1.510358 8.0 10.00 10.00 12.0 13.0
Chihuahua 91.0 10.516484 2.071568 3.0 9.50 11.00 12.0 14.0
Eskimo_dog 22.0 11.409091 1.402688 9.0 10.00 12.00 12.0 14.0
French_bulldog 31.0 11.193548 1.796652 8.0 10.00 12.00 12.0 14.0
German_shepherd 21.0 11.000000 1.449138 8.0 10.00 11.00 12.0 13.0
Labrador_retriever 108.0 11.180556 1.324567 7.0 10.00 11.00 12.0 14.0
Pembroke 95.0 11.389474 1.746088 4.0 11.00 12.00 12.0 14.0
Pomeranian 42.0 10.779762 1.619435 6.0 10.00 11.00 12.0 14.0
Samoyed 42.0 11.690476 1.352290 7.0 11.00 12.00 13.0 14.0
Shih-Tzu 20.0 10.350000 1.308877 8.0 9.75 10.50 11.0 13.0
Siberian_husky 20.0 11.025000 1.720427 5.5 10.00 11.00 12.0 13.0
Staffordshire_bullterrier 21.0 10.761905 1.374946 8.0 10.00 11.00 12.0 13.0
beagle 20.0 10.150000 1.531253 6.0 9.75 10.00 11.0 13.0
chow 48.0 11.416667 1.350072 7.0 11.00 12.00 12.0 13.0
cocker_spaniel 30.0 11.316667 1.177983 9.0 10.25 11.25 12.0 13.0
golden_retriever 157.0 11.624204 1.188427 8.0 11.00 12.00 12.0 14.0
malamute 33.0 10.878788 1.218544 8.0 10.00 11.00 12.0 13.0
miniature_pinscher 25.0 10.000000 2.565801 2.0 9.00 11.00 12.0 12.0
pug 62.0 10.241935 1.816910 3.0 9.25 10.00 11.0 13.0
toy_poodle 51.0 11.039216 1.264291 7.0 10.00 11.00 12.0 13.0
In [8]:
df['rating'].describe()
Out[8]:
count    1991.000000
mean       11.647638
std        40.668547
min         1.000000
25%        10.000000
50%        11.000000
75%        12.000000
max      1776.000000
Name: rating, dtype: float64

Here we have a statistical description of the top breeds compared to the statistical description of all the ratings. Only one of the top breeds has a mean higher than the total population mean. That might be because of the joke ratings of 420 and 1776 pulling up the total population mean.

In [9]:
df[df['rating'] <= 14]['rating'].describe()
Out[9]:
count    1989.000000
mean       10.555277
std         2.157977
min         1.000000
25%        10.000000
50%        11.000000
75%        12.000000
max        14.000000
Name: rating, dtype: float64

So I adjusted the ratings to exclude the joke rates and now the mean is 10.555. Only five of the top 21 breeds have means under the total population's average. So these breeds are rated higher than average.

Dog Stages Stats

In [10]:
df.boxplot(column=['rating'], by=['dog_type'])
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x1abf205ea20>
In [11]:
df.groupby('dog_type')['rating'].describe()
Out[11]:
count mean std min 25% 50% 75% max
dog_type
doggo 69.0 11.797101 1.510548 8.0 11.0 12.0 13.0 14.0
floofer 34.0 11.705882 0.759961 10.0 11.0 12.0 12.0 13.0
pupper 237.0 10.616160 1.833623 3.0 10.0 11.0 12.0 14.0
puppo 29.0 12.172414 1.197288 9.0 12.0 13.0 13.0 14.0

So puppers are getting much lower rates than the other dog types. Their median is lower and they have several low outliers. This makes sense since 'pupper' can be used to describe irresponsible dogs.

Floofers are consistently rated above 10. I wonder if that is because they are always awesome or if it is based on time. We know that the ratings have been getting higher. If 'floof' is a newer term, only used in newer tweets, that might explain the consistently higher rates.

In [12]:
df.reset_index(inplace=True)
df.groupby('dog_type')['timestamp'].describe()
Out[12]:
count unique top freq first last
dog_type
doggo 69 69 2016-11-18 23:35:32 1 2016-04-02 01:52:38 2017-07-26 15:59:51
floofer 34 34 2016-07-05 20:41:01 1 2016-01-08 03:50:03 2017-07-18 00:07:08
pupper 237 237 2016-01-30 02:41:58 1 2015-11-26 21:36:12 2017-07-15 23:25:31
puppo 29 29 2017-01-29 02:44:34 1 2016-06-03 01:07:16 2017-07-25 01:55:32

Here we see that 'floof' is not a new term, first used in January 2016. So that means we can conclude that the floofers are consistently great dogs.