Cities Text Feature Extraction

Cities of the World Text Feature Extraction

Using Pandas and simple Python expressions, I created new text features like the starting letter of a city and the number of characters in it

Dataset is available on Kaggle: https://www.kaggle.com/max-mind/world-cities-database

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib notebook

Import Data

In [2]:
df = pd.read_csv('worldcities.csv')
print(df.head())
  Country        AccentCity  Population
0      ad  Andorra la Vella       20430
1      ad           Canillo        3292
2      ad            Encamp       11224
3      ad        La Massana        7211
4      ad      Les Escaldes       15854

Countries represented in the dataset

In [3]:
print(df.Country.unique())
['ad' 'ae' 'af' 'ag' 'ai' 'al' 'am' 'an' 'ao' 'ar' 'at' 'au' 'aw' 'az'
 'ba' 'bb' 'bd' 'be' 'bf' 'bg' 'bh' 'bi' 'bj' 'bm' 'bo' 'br' 'bs' 'bt'
 'bw' 'by' 'bz' 'ca' 'cd' 'cf' 'cg' 'ch' 'ci' 'ck' 'cl' 'cm' 'cn' 'co'
 'cr' 'cu' 'cv' 'cy' 'cz' 'de' 'dj' 'dk' 'dm' 'do' 'dz' 'ec' 'ee' 'eg'
 'er' 'es' 'et' 'fi' 'fo' 'fr' 'ga' 'gb' 'gd' 'ge' 'gf' 'gh' 'gi' 'gl'
 'gm' 'gn' 'gp' 'gq' 'gr' 'gt' 'gw' 'gy' 'hn' 'hr' 'ht' 'hu' 'id' 'ie'
 'il' 'in' 'iq' 'ir' 'is' 'it' 'jm' 'jo' 'jp' 'ke' 'kg' 'kh' 'ki' 'km'
 'kn' 'kr' 'kw' 'ky' 'kz' 'la' 'lb' 'lc' 'li' 'lk' 'lr' 'ls' 'lt' 'lu'
 'lv' 'ly' 'ma' 'mc' 'md' 'me' 'mg' 'mk' 'ml' 'mm' 'mn' 'mq' 'mr' 'mt'
 'mu' 'mv' 'mw' 'mx' 'my' 'mz' 'na' 'nc' 'ne' 'ng' 'ni' 'nl' 'no' 'np'
 'nu' 'nz' 'om' 'pa' 'pe' 'pf' 'pg' 'ph' 'pk' 'pl' 'pm' 'pt' 'pw' 'py'
 'qa' 're' 'ro' 'rs' 'ru' 'rw' 'sa' 'sb' 'sc' 'sd' 'se' 'sg' 'sh' 'si'
 'sj' 'sk' 'sl' 'sm' 'sn' 'so' 'sr' 'sv' 'sy' 'sz' 'tc' 'td' 'tf' 'tg'
 'th' 'tj' 'tm' 'tn' 'to' 'tr' 'tt' 'tv' 'tw' 'tz' 'ua' 'ug' 'us' 'uy'
 'uz' 'vc' 've' 'vg' 'vn' 'vu' 'wf' 'ws' 'ye' 'yt' 'za' 'zm' 'zw']

Top 10 most common city names

In [4]:
city_freqdist = df.AccentCity.value_counts()
print(city_freqdist[0:10])
San Miguel       29
San Jose         28
San Antonio      27
San Vicente      25
San Francisco    24
San Isidro       22
San Juan         22
Santa Cruz       21
Oktyabrskiy      21
San Pedro        18
Name: AccentCity, dtype: int64

Create new features

In [5]:
df['ACity_length'] = df.apply(lambda row: len(row.AccentCity), axis = 1)
df['First_Letter'] = df.apply(lambda row: row.AccentCity[0], axis = 1)

Examine a single country

In [6]:
unitedstatesdf = df.loc[df['Country'] == 'us']
first_letter_freqdist = unitedstatesdf['First_Letter'].value_counts().sort_index()
ax = first_letter_freqdist.plot(kind='barh')
ax.invert_yaxis()
ax.set_title("Frequency of US Cities by Starting Letter")
Out[6]:
Text(0.5,1,'Frequency of US Cities by Starting Letter')

Lists for Continent grouping

In [7]:
Europe = ['ad', 'al', 'am', 'at', 'az', 'ba', 'be', 'bg', 'by', 'ch', 'cy', 'cz', 'de', 'dk', 'ee', 'es', 'fi', 'fr', 'gb', 'ge', 'gr', 'hr', 'hu', 'ie', 'is', 'it', 'li', 'lt', 'lu', 'lv', 'mc', 'md', 'me', 'mk', 'mt', 'nl', 'no', 'pl', 'pt', 'ro', 'rs', 'se', 'si', 'sk', 'sm', 'ua']
Asia = ['ae', 'af', 'bd', 'bh', 'bn', 'bt', 'cn','id', 'il', 'id', 'in' 'iq', 'ir', 'jo', 'jp', 'kg', 'kh', 'kp', 'kr', 'kw', 'kz', 'la', 'lb', 'lk', 'mm', 'mn', 'mv', 'my', 'np', 'om', 'ph', 'pk', 'qa', 'ru', 'sa', 'sg', 'sy', 'th', 'tj', 'tm', 'tr', 'tw', 'uz', 'vn', 'ye']
Namerica = ['ag', 'bb', 'bs', 'bz', 'ca', 'cr', 'cu', 'dm', 'do', 'gd', 'gt', 'hn', 'ht', 'jm', 'kn', 'lc', 'mx', 'ni', 'pa', 'sv', 'tt', 'us', 'vc']
Africa = ['ao', 'bf', 'bi', 'bj', 'bw', 'cf', 'cg', 'cd','ci', 'cm', 'cv', 'dj', 'dz', 'eg', 'er', 'et', 'ga', 'gh', 'gm', 'gn', 'gq', 'gw', 'ke', 'km','lr', 'ls', 'ly', 'ma', 'mg', 'ml', 'mr', 'mu', 'mw', 'mz', 'na', 'ne', 'ng', 'rw', 'sc', 'sd', 'sl', 'sn', 'so', 'st', 'sz', 'td', 'tg', 'tn', 'tz', 'ug', 'za', 'zm', 'zw' ]
Samerica = ['ar', 'bo', 'br', 'cl', 'co', 'ec', 'gy', 'pe', 'py', 'sr', 'uy', 've']
Oceania = ['au', 'fj', 'fm', 'ki', 'mh', 'nr', 'nz', 'pg', 'pw', 'sb', 'to', 'tv', 'vu', 'ws']
Unknown = ['ai', 'an', 'aws', 'bm', 'cc', 'ck', 'cx','eh', 'fk', 'fo', 'gf', 'gg', 'gi', 'gl', 'gp', 'gs', 'hk', 'mo', 'mp', 'mq', 'ms', 'nc', 'nf', 'nu', 'pf', 'pm', 'pn', 'ps', 're', 'sh', 'sj', 'tc', 'tf', 'tk', 'vg', 'vi', 'wf', 'yt', 'zr']

Examine a Continent

In [9]:
oceaniadf = df.loc[df['Country'].isin(Oceania)]
length_freqdist = oceaniadf['ACity_length'].value_counts().sort_index()
ax = length_freqdist.plot(kind='barh')
ax.invert_yaxis()
ax.set_title("Frequency of Oceania Cities by Name Length")
Out[9]:
Text(0.5,1,'Frequency of Oceania Cities by Name Length')