import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import datetime as dt
matplotlib.style.use('seaborn-whitegrid')
df = pd.read_csv('twitter_archive_master.csv')
df['timestamp'] = pd.to_datetime(df['timestamp'])
df.set_index('timestamp', inplace=True)
df[['favorites', 'retweet_count']].plot(style = '.', alpha = .2)
plt.title('Favorites and Retweets over Time')
plt.xlabel('Date')
plt.ylabel('Count')
Here you can see the gradual incline of both favorites and retweets.
df.plot(y ='rating', ylim=[0,14], style = '.', alpha = .2)
plt.title('Rating Increase over Time')
plt.xlabel('Date')
plt.ylabel('Rating')
So Bront was right, the ratings are getting higher and higher. Whether the dogs are getting better I'm not sure.
df[['favorites', 'retweet_count', 'rating']].corr(method='pearson')
So I ran a correlation to see if dogs with higher ratings were getting more favorites and retweets. In my mind, if the dogs are getting better they should be getting more favorites and retweets along with the higher rating. There is definitely a correlation between favorites and retweets. This makes me think that if the tweet is good in general that there will be more retweets and favorites.
Yet there is no correlation between rating and retweets or favorites. There are a few possible explanations. One is that the dogs are not actually getting better. The other is that the 'lower quality' dogs are given funnier captions. In those cases it is the caption getting retweets and favorites, not the dog itself.
df[df['gender'].notnull()]['gender'].value_counts().plot(kind = 'pie')
plt.title('Dog Genders')
There were three times more boy dogs identified than girl dogs. Based on what I know about biology, this is likely not representative of the dog population at large. Likely this is due to more male dogs having their sex announced, either by their owner or by @dog_rates or both.
Also I remember one incidence of the word 'girl' that referred to a human girl that was also in the picture. So it is possible that some of the dogs were misgendered by my function. It's also possible that @dog_rates is misgendering some females by using male pronouns as gender-neutral, which is somewhat common. However, since only a small portion of the rated dogs were gendered at all, it is hard to draw concrete conclusions.
top_breeds=df.groupby('breed').filter(lambda x: len(x) >= 20)
top_breeds['breed'].value_counts().plot(kind = 'barh')
plt.title('Histogram of Most The Rated Breeds')
plt.xlabel('Count')
plt.ylabel('Breed')
It's difficult to know why these breeds are the top breeds. It could be because they are commonly owned. They could be the favorites of the owner of @dog_rates, so the ones they choose to rate the most. Or they could be the easiest to identify by the AI that identified them. Anecdotally, I have only ever owned the top two breeds.
top_breeds.groupby('breed')['rating'].describe()
df['rating'].describe()
Here we have a statistical description of the top breeds compared to the statistical description of all the ratings. Only one of the top breeds has a mean higher than the total population mean. That might be because of the joke ratings of 420 and 1776 pulling up the total population mean.
df[df['rating'] <= 14]['rating'].describe()
So I adjusted the ratings to exclude the joke rates and now the mean is 10.555. Only five of the top 21 breeds have means under the total population's average. So these breeds are rated higher than average.
df.boxplot(column=['rating'], by=['dog_type'])
df.groupby('dog_type')['rating'].describe()
So puppers are getting much lower rates than the other dog types. Their median is lower and they have several low outliers. This makes sense since 'pupper' can be used to describe irresponsible dogs.
Floofers are consistently rated above 10. I wonder if that is because they are always awesome or if it is based on time. We know that the ratings have been getting higher. If 'floof' is a newer term, only used in newer tweets, that might explain the consistently higher rates.
df.reset_index(inplace=True)
df.groupby('dog_type')['timestamp'].describe()
Here we see that 'floof' is not a new term, first used in January 2016. So that means we can conclude that the floofers are consistently great dogs.