Pokemon Data Analysis — Knowing Pokemon using Data

7 min readFeb 10, 2022

Pokemon, a legendary game we all played long time ago. We all know the fun and stress in battling the Gym Leader, finding numerous pokemon we loved, or even just wondering the map to find all secrets. My first pokemon game was when I was 10 years old. At that time, I don’t know much about pokemon and just use my Charizard to kill them all.

But now, I’m interested to explore deep down of Pokemon World using Explanatory Data Analysis to know more about which pokemon actually can help beat my Pokemon Games.

First of all, the data we are gonna use is from this dataset. This dataset contains all pokemon from Generation 1 to Generation 6. This dataset also provides us the details of pokemon such as attack, defense, HP, Sp. Attack, Sp. Defense, Speed, and PokeDex.

Now, let’s go through the analyzing process to find out which pokemon we should catch to win ’em all !

Gear Up

In this analysis, we will be using Python to analyze and also visualize our data. The libraries we will be using are Matplotlib, Seaborn, and Pandas. If you are new to these libraries try to see their documentation.

To use these libraries we must install them first.

#Installing pandas
pip install pandas#installing Matplotlib
pip install matplotlib#installing Seaborn
pip install seaborn

Importing Dataset

Now, we need to import the dataset, you can find the dataset here.

After you download the dataset, we need to import it using pandas read_csv

#Importing dataset
df = pd.read_csv('pokemon.csv')

The dataset will look like this

Make sure our data isn’t stinky

After we importing the dataset, we need to clean the data. We can see clearly there are some data need to cleaned such as VenusaurMega Venusaur should be Mega Venusaur and HoopaHoopa Confined should be Hoopa Confined and others. That’s very bad and we need to clean it. To clean it we are using Replace function.

#Removing extra Mega
df.name = df.name.str.replace(".*(?=Mega)","")#Removing extra Hoopa
df.name = df.name.str.replace(".*(?=Hoopa)","")#Removing extra Primal
df.name = df.name.str.replace(".*(?=Primal)","")

Then, we need to make pokedex as our index. In the dataset, # are pokedex number of the pokemon. So, we need to change the column name to pokedex and make it into our index.

#Renaming # into pokedex
df = df.rename(columns={'#' : 'pokedex'})#Changing index into pokedex
df = df.set_index('pokedex')

after we do the cleaning, the data should be like this.

Then, we need to check if there is an NA value in the data.

#Check if there is NA
df.isna().sum()

the data seems good, 386 in type_2 indicates that the pokemon is a single type pokemon.

Everything is set, now let’s move to the fun part!

Analyze ’em All !

First we need to add column named overall_stat since we need it for later. Overall stat is the average of all stat of the Pokemon

df['overall_stats']= df[['hp','attack','defense','sp_atk','sp_def','speed']].mean(axis=1).round(2)

Now let’s do a visualization on Pokemon generation and their overall stats

fig, a = plt.subplots(1,2,figsize=(15,5))
labels = ['Gen 1', 'Gen 5', 'Gen 3', 'Gen 4', ' Gen 2', ' Gen 6']pie = df.generation.value_counts().plot(ax=a[0],kind= 'pie',autopct = '%1.1f%%',labels=labels)
pie.set(ylabel = None)
pie.set_title('Percentage of Pokemon in Every Generation')stats = sns.barplot(ax=a[1],data=df,x='generation',y='overall_stats')
stats.set(ylabel = None, xlabel= 'Generation')
stats.set_title('Pokemon Average Overall Stats Every Generation')

box = sns.boxplot(x=df.generation,y=df.overall_stats)box.set_title('Overall stats of Pokemon in Every Generation - All Pokemon',fontsize= 28)box.set(ylabel=None,xlabel='Generation')

Voila, now we can see all information regarding generation and overall stats easily. As we can see, most Pokemon come from Gen-1, Gen-2, and Gen-3 with ~20% for each of them. But, if we see in the overall stat it’s pretty balanced for every generation. However, the third generation stands out the most with their highest maximum overall stat.

Now let’s do some analysis with Pokemon type and their overall stat.

fig, a = plt.subplots(1,2,figsize=(20,5))def autopct(pct):
    return ('%.1f' % pct) if pct > 5 else ''pie = df.type_1.value_counts().plot(ax=a[0],kind='pie',autopct=autopct)
pie.set_title('Total Pokemon of Each Type')
pie.set(ylabel=None)stats = sns.barplot(ax=a[1],data=df,x='type_1',y='overall_stats')
stats.set(ylabel = None, xlabel= 'Generation')
stats.tick_params(axis='x',rotation=45)
stats.set_title('Pokemon Average Overall Stats every Type',fontsize= 28)

box = sns.boxplot(y=df.overall_stats,x=df.type_1)box.set_title('Overall stats of Type - All Pokemon',fontsize= 28)box.set(ylabel = None, xlabel= 'Type')

In the above chart, we can see the detail about the pokemon type and their overall stat. The most pokemon type there is Water-type by 14% followed by Normal type by 12.3%. In the boxplot we can see the distribution of their overall stat, the most outstanding stat is Dragon-Type but Psychic and Ground-type hit the maximum point in the overall stat same as the Dragon-type. However, this is all pokemon overall stat. We all know that Legendary and Mega Pokemon hold the best stat in the Pokemon world. But, we need to plot it first to make sure about the statement.

atk = sns.barplot(x=df.nlargest(10,'overall_stats').name,y=df.nlargest(10,'overall_stats').overall_stats,hue=df.nlargest(10,'overall_stats').type_1,dodge=False)atk.tick_params(axis='x',rotation=20)atk.set_title('Pokemon Highest Overall Stats for All Generation',fontsize= 28)atk.set(ylabel = None, xlabel= None,ylim=(10,180))
plt.legend(title = 'Type')

As we can see in this chart, highest overall stats dominated by Mega and Legendary pokemon. Realistically, we can’t catch every Legendary or Mega pokemon to help us. So, let’s make a chart contain only Non Legendary/Mega pokemon.

box = sns.boxplot(x=df[(~df.legendary) & (~df.name.str.contains('Mega'))].type_1, y=df[(~df.legendary) & (~df.name.str.contains('Mega'))].overall_stats )box.set_title('Overall stats of type - Non Legendary/Mega Pokemon',fontsize= 28)box.set(ylabel=None,xlabel='Type')

atk = sns.barplot(x=df[(~df['legendary']) & (~df['name'].str.contains('Mega'))].nlargest(10,'overall_stats').name,y=df[~df['legendary'] & (~df['name'].str.contains('Mega')) ].nlargest(10,'overall_stats').overall_stats,hue=df[~df['legendary'] & (~df['name'].str.contains('Mega')) ].nlargest(10,'overall_stats').type_1,dodge=False)atk.tick_params(axis='x',rotation=20)atk.set_title('Non Legendary/Mega Pokemon Highest Overall Stats for All Generation',fontsize= 28)atk.set(ylabel = None, xlabel= None,ylim=(10,150))plt.legend(title = 'Type')

Now let’s see the visualization for non-legendary/mega pokemon. In this category Dragon-type also become the best type. But, the difference between all types is pretty balanced. In the boxplot also we can spot that the normal type has the highest maximum point on overall stats among all types. Slaking becomes the ‘King’ of the Non-Legendary/Mega pokemon with the highest overall stats. Slaking is also a Normal-type pokemon and it’s clear why normal pokemon has the highest maximum point in the boxplot. We can also see that every other highest pokemon besides Slaking has the same overall stat which is 100.

Now let’s move to analyze our favorite stat, Attack Status

There will no doubt that almost everyone playing Pokemon always looking for a high attack status. High attack means we can get rid of everyone interfere ! Now, let’s do a visualization to find out about our lovely attack stat.

atk = sns.barplot(x=df.nlargest(10,'attack').name,y=df.nlargest(10,'attack').attack,hue=df.nlargest(10,'attack').type_1,dodge=False)atk.tick_params(axis='x',rotation=20)atk.set_title('Pokemon Highest Attack Status for All Generation',fontsize= 28)atk.set(ylabel = None, xlabel= None,ylim=(10,250))plt.legend(title = 'Type')

atk = sns.barplot(x=df[(~df['legendary']) & (~df['name'].str.contains('Mega'))].nlargest(10,'attack').name,y=df[~df['legendary'] & (~df['name'].str.contains('Mega')) ].nlargest(10,'attack').attack,hue=df[~df['legendary'] & (~df['name'].str.contains('Mega')) ].nlargest(10,'attack').type_1,dodge=False)atk.tick_params(axis='x',rotation=20)atk.set_title('Non Legendary/Mega Pokemon Highest Attack Status for All Generation',fontsize= 28)atk.set(ylabel = None, xlabel= None,ylim=(10,250))plt.legend(title = 'Type')

As always, Legendary and Mega pokemon will be the best pokemon no matter what. So, let’s dig into Non-Legendary/Mega pokemon chart. Rampardos hold the best attack status but he does not appear in our best overall stats. This means Rampardos only shine in his attack stat while maybe his defense or speed is weaker. If we see more closely, we can also see Slaking in the highest attack stats.

Conclusion

There are so many Pokemon in the Pokemon World. With so many different stats in pokemon, we already analyzed the overall status and attack status to find out which pokemon can help us in the Pokemon world.

Legendary and Mega pokemon will still the be number one recommendation since they are having a big overall and attack status. But, if you want to have a bulky non-legendary/mega pokemon you can try Dragon-type pokemon or Slaking. With two time appearance in best overall status and best overall attack, Slaking will make sure to make your Pokemon journey easier.

However, this analysis was only calculated based on overall stats and attack stats. The Pokemon battle is determined by so many factors, so this recommendation will not be 100 % accurate but will be most likely to help you in your journey on Pokemon World.

Hey, that’s the end of our journey. If you are interested in the visualization used here feel free to see the code here!

thanks for reading and feel free to comment on everything, see you !