import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import datetime as dt
matplotlib.style.use('seaborn-whitegrid')
df = pd.read_csv('twitter_archive_master.csv')
df['timestamp'] = pd.to_datetime(df['timestamp'])
df.set_index('timestamp', inplace=True)

Retweets, Favorites and Ratings Correlation¶

df[['favorites', 'retweet_count']].plot(style = '.', alpha = .2)
plt.title('Favorites and Retweets over Time')
plt.xlabel('Date')
plt.ylabel('Count')

<matplotlib.text.Text at 0x1abf1ce39e8>

Here you can see the gradual incline of both favorites and retweets.

df.plot(y ='rating', ylim=[0,14], style = '.', alpha = .2)
plt.title('Rating Increase over Time')
plt.xlabel('Date')
plt.ylabel('Rating')

<matplotlib.text.Text at 0x1abf1e61630>

So Bront was right, the ratings are getting higher and higher. Whether the dogs are getting better I'm not sure.

df[['favorites', 'retweet_count', 'rating']].corr(method='pearson')

So I ran a correlation to see if dogs with higher ratings were getting more favorites and retweets. In my mind, if the dogs are getting better they should be getting more favorites and retweets along with the higher rating. There is definitely a correlation between favorites and retweets. This makes me think that if the tweet is good in general that there will be more retweets and favorites.

Yet there is no correlation between rating and retweets or favorites. There are a few possible explanations. One is that the dogs are not actually getting better. The other is that the 'lower quality' dogs are given funnier captions. In those cases it is the caption getting retweets and favorites, not the dog itself.

Good Boys and Good Girls¶

df[df['gender'].notnull()]['gender'].value_counts().plot(kind = 'pie')
plt.title('Dog Genders')

<matplotlib.text.Text at 0x1abf1f6f080>

There were three times more boy dogs identified than girl dogs. Based on what I know about biology, this is likely not representative of the dog population at large. Likely this is due to more male dogs having their sex announced, either by their owner or by @dog_rates or both.

Also I remember one incidence of the word 'girl' that referred to a human girl that was also in the picture. So it is possible that some of the dogs were misgendered by my function. It's also possible that @dog_rates is misgendering some females by using male pronouns as gender-neutral, which is somewhat common. However, since only a small portion of the rated dogs were gendered at all, it is hard to draw concrete conclusions.

Most Rated Breeds¶

top_breeds=df.groupby('breed').filter(lambda x: len(x) >= 20)
top_breeds['breed'].value_counts().plot(kind = 'barh')
plt.title('Histogram of Most The Rated Breeds')
plt.xlabel('Count')
plt.ylabel('Breed')

<matplotlib.text.Text at 0x1abf1fcbcf8>

It's difficult to know why these breeds are the top breeds. It could be because they are commonly owned. They could be the favorites of the owner of @dog_rates, so the ones they choose to rate the most. Or they could be the easiest to identify by the AI that identified them. Anecdotally, I have only ever owned the top two breeds.

top_breeds.groupby('breed')['rating'].describe()

df['rating'].describe()

count    1991.000000
mean       11.647638
std        40.668547
min         1.000000
25%        10.000000
50%        11.000000
75%        12.000000
max      1776.000000
Name: rating, dtype: float64

Here we have a statistical description of the top breeds compared to the statistical description of all the ratings. Only one of the top breeds has a mean higher than the total population mean. That might be because of the joke ratings of 420 and 1776 pulling up the total population mean.

df[df['rating'] <= 14]['rating'].describe()

count    1989.000000
mean       10.555277
std         2.157977
min         1.000000
25%        10.000000
50%        11.000000
75%        12.000000
max        14.000000
Name: rating, dtype: float64

So I adjusted the ratings to exclude the joke rates and now the mean is 10.555. Only five of the top 21 breeds have means under the total population's average. So these breeds are rated higher than average.

Dog Stages Stats¶

df.boxplot(column=['rating'], by=['dog_type'])

<matplotlib.axes._subplots.AxesSubplot at 0x1abf205ea20>

df.groupby('dog_type')['rating'].describe()

So puppers are getting much lower rates than the other dog types. Their median is lower and they have several low outliers. This makes sense since 'pupper' can be used to describe irresponsible dogs.

Floofers are consistently rated above 10. I wonder if that is because they are always awesome or if it is based on time. We know that the ratings have been getting higher. If 'floof' is a newer term, only used in newer tweets, that might explain the consistently higher rates.

df.reset_index(inplace=True)
df.groupby('dog_type')['timestamp'].describe()

Here we see that 'floof' is not a new term, first used in January 2016. So that means we can conclude that the floofers are consistently great dogs.

	count	mean	std	min	25%	50%	75%	max
breed
Cardigan	21.0	11.142857	1.590148	7.0	10.00	11.00	12.0	13.0
Chesapeake_Bay_retriever	31.0	10.741935	1.510358	8.0	10.00	10.00	12.0	13.0
Chihuahua	91.0	10.516484	2.071568	3.0	9.50	11.00	12.0	14.0
Eskimo_dog	22.0	11.409091	1.402688	9.0	10.00	12.00	12.0	14.0
French_bulldog	31.0	11.193548	1.796652	8.0	10.00	12.00	12.0	14.0
German_shepherd	21.0	11.000000	1.449138	8.0	10.00	11.00	12.0	13.0
Labrador_retriever	108.0	11.180556	1.324567	7.0	10.00	11.00	12.0	14.0
Pembroke	95.0	11.389474	1.746088	4.0	11.00	12.00	12.0	14.0
Pomeranian	42.0	10.779762	1.619435	6.0	10.00	11.00	12.0	14.0
Samoyed	42.0	11.690476	1.352290	7.0	11.00	12.00	13.0	14.0
Shih-Tzu	20.0	10.350000	1.308877	8.0	9.75	10.50	11.0	13.0
Siberian_husky	20.0	11.025000	1.720427	5.5	10.00	11.00	12.0	13.0
Staffordshire_bullterrier	21.0	10.761905	1.374946	8.0	10.00	11.00	12.0	13.0
beagle	20.0	10.150000	1.531253	6.0	9.75	10.00	11.0	13.0
chow	48.0	11.416667	1.350072	7.0	11.00	12.00	12.0	13.0
cocker_spaniel	30.0	11.316667	1.177983	9.0	10.25	11.25	12.0	13.0
golden_retriever	157.0	11.624204	1.188427	8.0	11.00	12.00	12.0	14.0
malamute	33.0	10.878788	1.218544	8.0	10.00	11.00	12.0	13.0
miniature_pinscher	25.0	10.000000	2.565801	2.0	9.00	11.00	12.0	12.0
pug	62.0	10.241935	1.816910	3.0	9.25	10.00	11.0	13.0
toy_poodle	51.0	11.039216	1.264291	7.0	10.00	11.00	12.0	13.0

	count	unique	top	freq	first	last
dog_type
doggo	69	69	2016-11-18 23:35:32	1	2016-04-02 01:52:38	2017-07-26 15:59:51
floofer	34	34	2016-07-05 20:41:01	1	2016-01-08 03:50:03	2017-07-18 00:07:08
pupper	237	237	2016-01-30 02:41:58	1	2015-11-26 21:36:12	2017-07-15 23:25:31
puppo	29	29	2017-01-29 02:44:34	1	2016-06-03 01:07:16	2017-07-25 01:55:32

	favorites	retweet_count	rating
favorites	1.000000	0.914929	0.023167
retweet_count	0.914929	1.000000	0.023733
rating	0.023167	0.023733	1.000000

	count	mean	std	min	25%	50%	75%	max
dog_type
doggo	69.0	11.797101	1.510548	8.0	11.0	12.0	13.0	14.0
floofer	34.0	11.705882	0.759961	10.0	11.0	12.0	12.0	13.0
pupper	237.0	10.616160	1.833623	3.0	10.0	11.0	12.0	14.0
puppo	29.0	12.172414	1.197288	9.0	12.0	13.0	13.0	14.0