ZZEN9021 Principles of Programming
Questions:
Instructions
The aim of this task is for you to demonstrate what you have learned in the course, by analysing and visualising some data of your own choosing, and presenting it in a Jupyter notebook.
What you need to do
1.Find some data that interests you. It could be data from work, or data you have found online (perhaps at Kaggle, or at a government website). Make sure the data is spread across two or three files, so that you can demonstrate your ability to deal with multiple-file datasets. And make sure it is complicated enough for you to demonstrate your manipulation skills.
2.Load the data into data frames.
3.Inspect the data, to get a feel for it - what the columns are, what ranges of values they contain, how they are related, and so on.
4.Pose some interesting questions about the data - ones that will require enough manipulation for you to demonstrate your skills, and ones that you can also display as plots. Around six (6) complex questions will be fine. You may come up with more questions if needed or if they are less complex.
5.Clean the data as required.
6.Manipulate the data as required, to get answers to your questions.
7.Illustrate your answers using tables and plots if applicable. Ed will only allow you to upload a total of 20MB for this task, so keep this in mind when you are choosing your data.
Presentation
You should present your work in a Jupyter notebook. Explain all of your steps clearly, and show your code and its output. In addition, make sure you do the following:
Explain where you got your data from.
Explain why you find your questions interesting.
Draw clear answers to your questions.
Include a table of contents.
Include in-text citations and full references and use either APA or Harvard You will be given a mark out of 25. Marks will be awarded as follows:
Choice of data (5 marks). Full marks for choosing data that is interesting and sufficiently complex.
Inspection and cleaning of data (5 marks). Full marks for getting a good understanding of what's in the data, and cleaning it appropriately.
Manipulation of data (5 marks). Full marks for demonstrating many of the data manipulation skills taught in the course.
Visualisation of data (5 marks). Full marks for demonstrating many of the data visualisation skills taught in the course.
Clarity of presentation (5 marks). Full marks for presenting your work clearly and logically including in-text citations and full references.
Sample questions
Please find below some sample questions related to Bird Strikes dataset from Kaggle:
1.How many different species (distinct) have been involved in bird strikes?
2.Where the data has been recorded, what percentage of the time was the pilot warned of birds or wildlife?
3.What are the top five airline/operators involved in bird strikes?
4.What proportion (percentage) of bird strikes involved medium and large wildlife?
5.In which phase of flight would the aircraft be most likely to experience a bird strike?
6.Has a bird strike ever occurred whilst the aircraft was parked? If so, in how many instances?
7.Is a bird strike more likely to happen when there is no cloud as compared with overcast conditions?
8.In what precipitation conditions is a bird strike most likely to happen?
9.Which species of wildlife had the largest total cost? How much was this?
10.For the most costly species of wildlife, which phase of flight was the most common for a bird strike?
11.For the species Bald Eagle and Canadian Goose, which time of day bird strikes occur to see if there are any differences between the two species?
The above questions are just samples to inspire from.
A Sample Project from Larissa Simpson (a previous student) is also provided to inspire from.
World Food Production
This project will analyse world food production data (by total weight per country) between 1961 and 2013. It will look specifically at the types of food produced, relationships with country population data and also investigate if there is relationship between amount of food produced and levels of obesity.
Summary of contents:
1.Introduction
2.Summary
3.Data Inspection and Cleaning
4.Analysis and Answers to Questions
5.Conclusions and Discussion
6.Future Analysis
7.References
1. Introduction
A 2018 study published in Elementa by the University of California Press (Berners-Lee et al., 2018) discusses the complexity and importance of achieving sustainability of the world’s food supply into the future. Currently, world hunger is increasing, with estimates in 2018 putting the number of people suffering from hunger at 821 million (United Nations, 2018). While the amount of food currently produced would be sufficient to feed the projected world population in 2050, significant changes to our current production model and food consumption would be needed including less consumption of dairy and meats and better access to foods across the world ((Berners-Lee et al., 2018), (United Nations, 2018)). Further, obesity levels continue to rise in many countries and ensuring equal distribution, without excess and access to a balanced and nutritious diet is vital heading into the future.
The analysis completed as part of this project will help visualise the current food production environment and problems, and demonstrate the need for change in the future.
More personally, between July 2017 and March 2020, I lived and worked in Tokyo, Japan. One thing that I really noticed living there was the difference in food production between Japan and Australia. In Australia, I was used to the majority of fresh foods and meat being produced in Australia and didn't give this much thought as a consumer. However, in a Japanese supermarket, products were covered with labels indicating country of origin; beef from Australia, pork from the USA, apples from Australia and avocados from Mexico. I often found that the quality of these products was lower than the local produce I was used to from Australia. Because of this, Japanese people place a lot of importance on their agriculture industry, especially the production of rice which has long been a part of their culture, having been cultivated for more than 3,000 years (Hays 2020). This experience is one of the reasons why I personally find this data set interesting.
While it is disappointing that the data in this data set is not more recent (being only until 2013), it still provides interesting insights into global food production. The topic will be investigated by analysing the data to answer the following questions:
Questions to be answered:
The following questions will be answered:
In 2013, what was the most produced type of food in the world by weight?
In 2013, what was the most produced type of food in Australia and for Australia's top 20 foods, how do the proportions of these compare with the world?
In 2013, which countries produced the most amount of food per 10,000 people?
Continuing with 2013, how does the top food product produced in each country compare to how far it is located from the equator?
Is there a relationship between food production per 10,000 people and a country's obesity %?
Which countries produced the most total food in 2013, and how has their production per 10,000 people changed over time?
What proportion of the world's food by weight is produced for animal feed vs. human food and how has total production changed over time?
Data Files
This notebook uses the below four data files. The key that links the files is Country Code, or in the case of the Obesity file, Country Name and Year. The files will be inspected in detail in section 3 below, however a brief summary is also provided here.
File obtained from Kaggle (Oppenheim, 2017). The Kaggle author modified data available from Food and Agriculture Organization of the United Nations. Main data columns include:
Item (114 Different food types)
Element (Feed for animals or food for humans)
Latitude
Longitude
Values are the total weight of food produced in each category in units of 1,000 tonnes.
File obtained from Kaggle (Chadha, 2018). Main data columns include:
File obtained from Kaggle (Arora, 2020). Main data columns include:
Used in Q3 and available as part of the Geopandas library (2013-2019) rather than as a separate file. A world map GeoDataFrame that allows visualisation of data based on country.
2. Summary
Overall, the relationship between weight of food produced turned out to be quite complex, with a number of future avenues of research identified. While a full discussion of results is provided in the Results and Conclusion section, overall it was found that there are many regional variations in total food production and the types of food produced. Overall it was concluded that in order to meet the demands of the future, changes in the types of foods consumed and the continued and expanded global trade of food will be vital (Berners-Lee et al., 2018).
3. Data Inspection & Cleaning
+ Code+ Text
# Start by importing required modules
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from math import log
from numpy import where
import geopandas
import geoplot
%matplotlib inline
+ Code+ Text
# Import the required data
# A change in encoding type was needed due to an error in data loading for food file.
food_df = pd.read_csv("food_data.csv", encoding='latin-1')
population_df = pd.read_csv("world_pop.csv", index_col=1)
obesity_df = pd.read_csv("obesity.csv", index_col=0)
# Have a quick look at the food dataframe
print(food_df.columns.is_unique)
print(food_df.shape)
print(food_df[food_df.duplicated() == True])
food_df.head()
[2]
True
(21477, 63)
Empty DataFrame
Columns: [Area Abbreviation, Area Code, Area, Item Code, Item, Element Code, Element, Unit, latitude, longitude, Y1961, Y1962, Y1963, Y1964, Y1965, Y1966, Y1967, Y1968, Y1969, Y1970, Y1971, Y1972, Y1973, Y1974, Y1975, Y1976, Y1977, Y1978, Y1979, Y1980, Y1981, Y1982, Y1983, Y1984, Y1985, Y1986, Y1987, Y1988, Y1989, Y1990, Y1991, Y1992, Y1993, Y1994, Y1995, Y1996, Y1997, Y1998, Y1999, Y2000, Y2001, Y2002, Y2003, Y2004, Y2005, Y2006, Y2007, Y2008, Y2009, Y2010, Y2011, Y2012, Y2013]
Index: []
[0 rows x 63 columns]
[2]
+ Code+ Text
+ Code+ Text
Overall, this data looks pretty good. The columns are unique, there are no duplicated rows and we know the size of the dataframe. There is however, some cleaning that is required to make the data more clear and useable.
The following issues were discovered whilst working with the data, mainly through inconsistent results in graphical representations. They have been cleaned and corrected.
New Countries Correction
One thing to note about the food data is that countries that no longer existed in 2013 have been completely removed. Conversely, countries that formed between 1961 and 2013 have been included. This becomes a problem in Q7, where, for example, the collapse of the Soviet Union is apparent in the time graph.
The solution to this is to delete the new countries from the data as this has only a minor impact. However, as this is only required for Q7 and we want to keep all the countries for the other questions, this data cleaning will be performed as part of Q7.
Other Corrections
Will be completed now and include:
The vegetables food type was split across two categories "Vegetables" and "Vegetables- Other". These have been merged into one.
A couple of errors in country codes were discovered which affected merging of the separate files. This is corrected here in the data cleaning section.
Filling missing values in with 0 (where appropriate) and removing un-needed columns.
Taiwan is missing from population data. Initially I thought this was due to it being included with China, but on inspection this did not seem to be the case. Taiwan is therefore excluded from the results that utilise the population file.
Obesity data had a value of 'no data' mixed in with numerical strings. This became evident when trying to convert the percentages to floats and is now corrected as part of the initial data cleaning.
Let's now clean up the data a bit to include only the columns needed, and rename the columns for more consistency.
File 1- Food Data
+ Code+ Text
# Remove columns that are not required
food_df = food_df.drop(
['Area Code', 'Item Code', 'Unit', 'Element Code'],
axis=1
)
# Rename the remaining columns
food_df.rename(columns=lambda x: x[-4:] if x[0] == 'Y' else x, inplace=True)
food_df = food_df.rename(
columns = {
'Area Abbreviation': 'Country Code', 'Item': 'Food', 'Element': "Type",
'latitude': 'Latitude', 'longitude': 'Longitude', 'Area': 'Country'
# Replace 'Vegetables, Other' with 'Vegetables' as category is currently split
food_df['Food'] = food_df['Food'].replace(['Vegetables, Other'],['Vegetables'])
# Clean up incorrect country codes (these were found in the process of completing)
food_df['Country Code'] = food_df['Country Code'].replace(['KOR','QAT'],['PRK', 'KOR'])
food_df['Country Code'] = where(food_df['Country'] == 'Bahamas', 'BHS', food_df['Country Code'])
# Fix Country Codes for China where China, Taiwan, Hong Kong and Macao all have same code
food_df['Country Code'] = where(
food_df['Country'] == 'China, Taiwan Province of', 'TWN', food_df['Country Code']
food_df['Country Code'] = where(
food_df['Country'] == 'China, Hong Kong SAR', 'HKG', food_df['Country Code']
food_df['Country Code'] = where(
food_df['Country'] == 'China, Macao SAR', 'MAC', food_df['Country Code']
print(food_df.describe())
food_df.info()
[4]
Latitude Longitude 1961 1962 \
count 21477.000000 21477.000000 17938.000000 17938.000000
mean 20.450613 15.794445 195.262069 200.782250
std 24.628336 66.012104 1864.124336 1884.265591
min -40.900000 -172.100000 0.000000 0.000000
25% 6.430000 -11.780000 0.000000 0.000000
50% 20.590000 19.150000 1.000000 1.000000
75% 41.150000 46.870000 21.000000 22.000000
max 64.960000 179.410000 112227.000000 109130.000000
1963 1964 1965 1966 \
count 17938.000000 17938.000000 17938.000000 17938.000000
mean 205.464600 209.925577 217.556751 225.988962
std 1861.174739 1862.000116 2014.934333 2100.228354
min 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000
50% 1.000000 1.000000 1.000000 1.000000
75% 23.000000 24.000000 25.000000 26.000000
max 106356.000000 104234.000000 119378.000000 118495.000000
1967 1968 ... 2004 2005 \
count 17938.000000 17938.000000 ... 21128.000000 21128.000000
mean 230.419222 238.415487 ... 486.690742 493.153256
std 2132.244717 2189.166883 ... 5001.782008 5100.057036
min 0.000000 0.000000 ... 0.000000 0.000000
25% 0.000000 0.000000 ... 0.000000 0.000000
50% 1.000000 2.000000 ... 6.000000 6.000000
75% 27.000000 28.000000 ... 75.000000 77.000000
max 118725.000000 127512.000000 ... 360767.000000 373694.000000
2006 2007 2008 2009 \
count 21373.000000 21373.000000 21373.000000 21373.000000
mean 496.319328 508.482104 522.844898 524.581996
std 5134.819373 5298.939807 5496.697513 5545.939303
min 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000
50% 7.000000 7.000000 7.000000 7.000000
75% 78.000000 80.000000 82.000000 83.000000
max 388100.000000 402975.000000 425537.000000 434724.000000
2010 2011 2012 2013
count 21373.000000 21373.000000 21477.000000 21477.000000
mean 535.492069 553.399242 560.569214 575.557480
std 5721.089425 5883.071604 6047.950804 6218.379479
min 0.000000 0.000000 -169.000000 -246.000000
25% 0.000000 0.000000 0.000000 0.000000
50% 7.000000 8.000000 8.000000 8.000000
75% 83.000000 86.000000 88.000000 90.000000
max 451838.000000 462696.000000 479028.000000 489299.000000
[8 rows x 55 columns]
RangeIndex: 21477 entries, 0 to 21476
Data columns (total 59 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country Code 21477 non-null object
1 Country 21477 non-null object
2 Food 21477 non-null object
3 Type 21477 non-null object
4 Latitude 21477 non-null float64
5 Longitude 21477 non-null float64
6 1961 17938 non-null float64
7 1962 17938 non-null float64
8 1963 17938 non-null float64
9 1964 17938 non-null float64
10 1965 17938 non-null float64
11 1966 17938 non-null float64
12 1967 17938 non-null float64
13 1968 17938 non-null float64
14 1969 17938 non-null float64
15 1970 17938 non-null float64
16 1971 17938 non-null float64
17 1972 17938 non-null float64
18 1973 17938 non-null float64
19 1974 17938 non-null float64
20 1975 17938 non-null float64
21 1976 17938 non-null float64
22 1977 17938 non-null float64
23 1978 17938 non-null float64
24 1979 17938 non-null float64
25 1980 17938 non-null float64
26 1981 17938 non-null float64
27 1982 17938 non-null float64
28 1983 17938 non-null float64
29 1984 17938 non-null float64
30 1985 17938 non-null float64
31 1986 17938 non-null float64
32 1987 17938 non-null float64
33 1988 17938 non-null float64
34 1989 17938 non-null float64
35 1990 18062 non-null float64
36 1991 18062 non-null float64
37 1992 20490 non-null float64
38 1993 20865 non-null float64
39 1994 20865 non-null float64
40 1995 20865 non-null float64
41 1996 20865 non-null float64
42 1997 20865 non-null float64
43 1998 20865 non-null float64
44 1999 20865 non-null float64
45 2000 21128 non-null float64
46 2001 21128 non-null float64
47 2002 21128 non-null float64
48 2003 21128 non-null float64
49 2004 21128 non-null float64
50 2005 21128 non-null float64
51 2006 21373 non-null float64
52 2007 21373 non-null float64
53 2008 21373 non-null float64
54 2009 21373 non-null float64
55 2010 21373 non-null float64
56 2011 21373 non-null float64
57 2012 21477 non-null int64
58 2013 21477 non-null int64
dtypes: float64(53), int64(2), object(4)
memory usage: 9.7+ MB
+ Code+ Text
The data here all seems to make sense. The mean and max food production numbers are increasing over time as would be expected.
As noted previously, there are a number of null entitries, mostly in earlier years, due to countries not formed until later. It makes sense to fill these null values with 0 for the time being (until Q7) as we can treat this as no food production. By 2013, there are no longer any null values.
+ Code+ Text
[ ]
# Fill the missing values with 0
food_df.fillna(value=0, inplace=True)
# The data includes both food and feed. Mostly we will look just at food until Q7.
# Filter dataframe to exclude Feed
df = food_df[food_df['Type'] == 'Food']
df = df.drop(['Type'],axis = 1)
df.head()
[5]
+ Code+ Text
Let's also have a quick look at how many different types of food are included in the data set. By using value_counts(), we can see that there are 114 different types of food. That's quite a lot!
+ Code+ Text
[ ]
df['Food'].value_counts()
[6]
Milk - Excluding Butter 348 Vegetables 348 Eggs 348 Cereals - Excluding Beer 174 Meat, Other 174 ... Ricebran Oil 18 Palm kernels 17 Sugar beet 10 Meat, Aquatic Mammals 3 Cottonseed 1 Name: Food, Length: 114, dtype: int64
+ Code+ Text
File 2- Population Data
+ Code+ Text
# Inspect the data
print(population_df.columns.is_unique)
print(population_df.shape)
print(population_df[population_df.duplicated() == True])
print(population_df.describe())
population_df.info()
[7]
True
(217, 60)
Empty DataFrame
Columns: [Country, Indicator Name, Indicator Code, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016]
Index: []
[0 rows x 60 columns]
1960 1961 1962 1963 1964 \
count 2.140000e+02 2.140000e+02 2.140000e+02 2.140000e+02 2.140000e+02
mean 3.016968e+07 3.056597e+07 3.107441e+07 3.169516e+07 3.232110e+07
std 2.159191e+08 2.186293e+08 2.222889e+08 2.268859e+08 2.315160e+08
min 4.279000e+03 4.453000e+03 4.566000e+03 4.656000e+03 4.748000e+03
25% 2.370350e+05 2.384952e+05 2.416770e+05 2.472330e+05 2.533580e+05
50% 2.485136e+06 2.518778e+06 2.553392e+06 2.588994e+06 2.625678e+06
75% 8.154541e+06 8.388898e+06 8.583787e+06 8.801223e+06 8.907580e+06
max 3.034193e+09 3.075115e+09 3.127961e+09 3.192794e+09 3.258201e+09
1965 1966 1967 1968 1969 \
count 2.140000e+02 2.140000e+02 2.140000e+02 2.140000e+02 2.140000e+02
mean 3.295887e+07 3.362501e+07 3.428736e+07 3.495626e+07 3.566383e+07
std 2.362462e+08 2.412288e+08 2.461768e+08 2.511904e+08 2.565062e+08
min 4.841000e+03 4.936000e+03 5.033000e+03 5.161000e+03 5.303000e+03
25% 2.601970e+05 2.679000e+05 2.763265e+05 2.849745e+05 2.926555e+05
50% 2.663446e+06 2.702518e+06 2.747546e+06 2.810650e+06 2.885900e+06
75% 8.941470e+06 9.018596e+06 9.252574e+06 9.529238e+06 9.630918e+06
max 3.324952e+09 3.394865e+09 3.464440e+09 3.534821e+09 3.609384e+09
... 2007 2008 2009 2010 \
count ... 2.170000e+02 2.170000e+02 2.170000e+02 2.170000e+02
mean ... 6.370892e+07 6.448298e+07 6.525662e+07 6.602829e+07
std ... 4.690950e+08 4.748336e+08 4.805751e+08 4.863082e+08
min ... 1.007500e+04 1.004700e+04 1.002800e+04 1.002500e+04
25% ... 6.869580e+05 7.009500e+05 7.144580e+05 7.276410e+05
50% ... 5.949787e+06 6.047117e+06 6.121053e+06 6.164626e+06
75% ... 2.082760e+07 2.053788e+07 2.082489e+07 2.115164e+07
max ... 6.680423e+09 6.763746e+09 6.847215e+09 6.930657e+09
2011 2012 2013 2014 2015 \
count 2.170000e+02 2.170000e+02 2.170000e+02 2.170000e+02 2.170000e+02
mean 6.678155e+07 6.756343e+07 6.835548e+07 6.915420e+07 6.995370e+07
std 4.919485e+08 4.977488e+08 5.036086e+08 5.095140e+08 5.154269e+08
min 1.005700e+04 1.027900e+04 1.081900e+04 1.090800e+04 1.100100e+04
25% 7.405100e+05 7.529670e+05 7.580810e+05 7.633930e+05 7.774240e+05
50% 6.192560e+06 6.198258e+06 6.195970e+06 6.204108e+06 6.234955e+06
75% 2.174395e+07 2.234657e+07 2.296115e+07 2.346069e+07 2.378934e+07
max 7.012844e+09 7.097401e+09 7.182860e+09 7.268986e+09 7.355220e+09
2016
count 2.170000e+02
mean 7.076072e+07
std 5.213908e+08
min 1.109700e+04
25% 7.956010e+05
50% 6.293253e+06
75% 2.412716e+07
max 7.442136e+09
[8 rows x 57 columns]
Index: 217 entries, ABW to ZWE
Data columns (total 60 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country 217 non-null object
1 Indicator Name 217 non-null object
2 Indicator Code 217 non-null object
3 1960 214 non-null float64
4 1961 214 non-null float64
5 1962 214 non-null float64
6 1963 214 non-null float64
7 1964 214 non-null float64
8 1965 214 non-null float64
9 1966 214 non-null float64
10 1967 214 non-null float64
11 1968 214 non-null float64
12 1969 214 non-null float64
13 1970 214 non-null float64
14 1971 214 non-null float64
15 1972 214 non-null float64
16 1973 214 non-null float64
17 1974 214 non-null float64
18 1975 214 non-null float64
19 1976 214 non-null float64
20 1977 214 non-null float64
21 1978 214 non-null float64
22 1979 214 non-null float64
23 1980 214 non-null float64
24 1981 214 non-null float64
25 1982 214 non-null float64
26 1983 214 non-null float64
27 1984 214 non-null float64
28 1985 214 non-null float64
29 1986 214 non-null float64
30 1987 214 non-null float64
31 1988 214 non-null float64
32 1989 214 non-null float64
33 1990 216 non-null float64
34 1991 216 non-null float64
35 1992 215 non-null float64
36 1993 215 non-null float64
37 1994 215 non-null float64
38 1995 216 non-null float64
39 1996 216 non-null float64
40 1997 216 non-null float64
41 1998 217 non-null int64
42 1999 217 non-null int64
43 2000 217 non-null int64
44 2001 217 non-null int64
45 2002 217 non-null float64
46 2003 217 non-null float64
47 2004 217 non-null float64
48 2005 217 non-null float64
49 2006 217 non-null float64
50 2007 217 non-null float64
51 2008 217 non-null float64
52 2009 217 non-null float64
53 2010 217 non-null float64
54 2011 217 non-null float64
55 2012 217 non-null float64
56 2013 217 non-null float64
57 2014 217 non-null float64
58 2015 217 non-null float64
59 2016 217 non-null int64
dtypes: float64(52), int64(5), object(3)
memory usage: 103.4+ KB
+ Code+ Text
The data seems to all make sense, with min, mean and max values all increasing over time as we would expect with population.
Again we have three countries missing values prior to 1998. In this case, it doesn't make sense to fill these with 0 as we don't want it to be confused with a population of 0. As it is only 3 rows affected, let's just remove these. We also need to remove the years that are not present in the food data file (1960 and 2014-16).
Here we will also add the additional columns needed to answer the questions and remove un-needed columns. Some of the data here looks like it needs rounding, however for accuracy, full decimal places will be kept here, with results rounded later.
+ Code+ Text
[ ]
# Remove columns that aren't needed
population_df = population_df.drop(
['Indicator Name', 'Indicator Code', '1960', '2014', '2015', '2016','Country'],
axis = 1
)
# Remove the missing data lines
population_df.dropna(axis=0, how='any', inplace=True)
# Stack the data to allow merging with the food data
population_ser = population_df.stack()
population_ser.name = "Population"
population_df = pd.DataFrame(population_ser)
population_df.index.names = ['Country Code', 'Year']
# Create a merged data frame including population so we can look at production per capita
country_food_df = df.groupby(['Country Code', 'Country']).sum()
country_food_df = country_food_df.drop(['Longitude','Latitude'],axis = 1)
country_food_ser = country_food_df.stack()
country_food_ser.name = "Total Food Production"
country_food_df = pd.DataFrame(country_food_ser)
country_food_df.index.names = ['Country Code', 'Country', 'Year']
full_df = country_food_df.merge(population_df, on=['Country Code', 'Year'], how='inner')
# Add a new column that shows Total Food Production per 10,000
full_df['Per 10,000'] = full_df['Total Food Production'] / (full_df['Population'] / 10000)
full_df = full_df.reset_index()
# Add country name back to the full_df
country_name_df = df.iloc[:,0:2]
full_df = full_df.merge(country_name_df, on=['Country Code'], how='left')
full_df = full_df[full_df.duplicated() == False]
full_df = full_df.reset_index(drop=True)
full_df.head()
[8]
+ Code+ Text
File 3- Obesity Data
+ Code+ Text
[ ]
# Inspect the data
print(obesity_df.columns.is_unique)
print(obesity_df.shape)
print(obesity_df[obesity_df.duplicated() == True])
print(obesity_df.describe())
obesity_df.info()
obesity_df
[9]
True
(24570, 4)
Empty DataFrame
Columns: [Country, Year, Obesity (%), Sex]
Index: []
Year
count 24570.000000
mean 1995.500000
std 12.121165
min 1975.000000
25% 1985.000000
50% 1995.500000
75% 2006.000000
max 2016.000000
Int64Index: 24570 entries, 0 to 24569
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country 24570 non-null object
1 Year 24570 non-null int64
2 Obesity (%) 24570 non-null object
3 Sex 24570 non-null object
dtypes: int64(1), object(3)
memory usage: 959.8+ KB
[9]
+ Code+ Text
We can see from the data description, that the obesity data only starts from 1975. This will be ok for our purposes, but will mean that the merged dataframe will contain some Null values under Obesity (%) as can be seen below.
Additionally, it was noted that one percentage was included as 'No data' which will be removed from the data. There is also some general formatting needed e.g. removing square brackets.
+ Code+ Text
# Filter to include only 'Both Sexes' and remove un-needed columns
obesity_df = obesity_df[obesity_df['Sex'] == 'Both sexes']
obesity_df = obesity_df.drop(['Sex'], axis = 1)
obesity_df = obesity_df[obesity_df['Obesity (%)'] != 'No data']
# Remove square brakets from data
def cleaner(x):
x = x.split('[')
return x[0]
obesity_df['Obesity (%)'] = obesity_df['Obesity (%)'].apply(cleaner)
obesity_df['Year'] = obesity_df['Year'].astype(str)
# Merge with the full_df from above
# Reminder, obesity data will have many Null values at this point
population_obesity_df = full_df.merge(obesity_df, on=['Country','Year'], how='left')
population_obesity_df
[10]
+ Code+ Text
4. Answers to the Questions
Q1. In 2013, what was the most produced type of food in the world by weight?
Let's look at the top 20 food product types across the world by total weight. A bar chart will be used rather than a pie graph because there are too many different type of food to show a meaningful result on a pie graph.
+ Code+ Text
[ ]
world_foods_df = df.groupby(['Food']).sum()
world_foods_df = world_foods_df.sort_values(
by = '2013',
ascending = False,
# Consider just the top 20 food items
world_foods_df = world_foods_df.iloc[0:20,:]
# Sort the results and show them on a bar graph
world_foods_df = world_foods_df.sort_values(
by = '2013',
ascending = True,
# Change total production units to Billion Tonnes
world_foods_df['2013'] = world_foods_df['2013'].apply(lambda x: x / 1000000)
world_foods_df['2013'].plot(
kind='barh',
figsize=[12, 8],
title='The World\'s Most Produced Types of Food by Weight (Billion Tonnes 2013)',
color='orange'
[11]
[11]
+ Code+ Text
Answer 1
In 2013, the most produced type of food by weight in the world was vegetables, followed by milk.
Q2. In 2013, what was the most produced type of food in Australia and for Australia's top 20 foods, how do the proportions of these compare with the world?
+ Code+ Text
# Filter to the Australian data for 2013
australia_df = df[df['Country Code'] == 'AUS']
australia_df = australia_df.groupby(['Food']).sum()
australia_ser = australia_df['2013']
# Sort the top 20 results and show them on a bar graph
australia_ser = australia_ser.nlargest(20)
# Merge with the world data found previously
australia_top_df = pd.DataFrame(australia_ser).merge(world_foods_df['2013'], on=['Food'], how='inner')
australia_top_df = australia_top_df.rename(
columns = {'2013_x': '2013 Australia', '2013_y': '2013 World'}
# Find the total food production for Australia and the World & calculate proportions
australia_total = australia_df['2013'].sum()
world_total = world_foods_df['2013'].sum()
australia_top_df['2013 Australia'] = australia_top_df['2013 Australia'].apply(
lambda x: x / australia_total
australia_top_df['2013 World'] = australia_top_df['2013 World'].apply(
lambda x: x / world_total
# Plot the combined data frame on one figure
australia_top_df.plot(
kind='bar',
subplots=True,
figsize=[20, 12],
sharey=True,
title=[
'Australia\'s Most Produced Types by Proportion of Total (2013)',
'The World\'s Most Produced Types by Proportion of Total (2013)'
xlabel='Type of Food',
fontsize=10
[12]
array([
, ], dtype=object)
[12]
+ Code+ Text
Please click to enlarge the graph if needed!
+ Code+ Text
Answer 2
In 2013, the most produced type of food by weight in Australia was Milk which represented approximately 25% of our food production.
Foods such as Beer, Alcohol and Meat are produced at relatively high proportions in Australia compared to the world overall. Conversely, Vegetables, Fruits and Cereals are produced at lower proportions in Australia.
Q3. In 2013, which countries produced the most amount of food per 10,000 people?
To answer this question we now need to look at the population_obesity_df including the population data and filter to include only 2013.
To best represent the results, they will be plotted on a world geographical heat map, allowing us to see trends across different regions. The Geopandas module will be used for this.
+ Code+ Text
[ ]
per_population_df = population_obesity_df
per_population_df = per_population_df[per_population_df['Year']=='2013']
# Load the world map data file
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
world = world[(world.name!="Antarctica")]
world.rename({'iso_a3': 'Country Code'}, axis=1, inplace=True)
# Fix some incorrect country codes that were discovered from missing data
world['Country Code'] = where(world['name']=='France', 'FRA', world['Country Code'])
world['Country Code'] = where(world['name']=='Norway', 'NOR', world['Country Code'])
# Merge with our data
world = world.merge(per_population_df, on=['Country Code'], how='left')
# Create the heatmap
fig, ax_world = plt.subplots(1, figsize=(18, 7))
ax_world.set_title(
'World Food Production per 10,000 People (\'000 Tonnes 2013)',
fontdict={'fontsize': '20', 'fontweight': '50'}
ax_world.axis('off')
world.plot(
column='Per 10,000',
legend=True,
ax=ax_world,
cmap='viridis',
missing_kwds={"color": "lightgrey"}
# Also show the top 15 countries in a table
per_population_df = per_population_df.sort_values(
by = 'Per 10,000',
ascending = False,
per_population_top20_df = per_population_df.iloc[0:15,:].reset_index(drop=True)
per_population_top20_df['Per 10,000'] = per_population_top20_df['Per 10,000'].apply(
lambda x: round(x,2)
)
per_population_top20_df = per_population_top20_df.drop(
['Year', 'Total Food Production', 'Population', 'Obesity (%)'],axis = 1)
per_population_top20_df
[13]
[13]
+ Code+ Text
Answer 3
In 2013, Montenegro was the country that produced the most food per 10,000 people at 24.2 thousand tonnes. Looking at the geographical heatmap, we can see that countries in Africa, South America and South Asia generally produce relatively less food per person compared with North America, Oceania and Europe.
Q4. Continuing with 2013, how does the top food product produced in each country compare to how far it is located from the equator?
+ Code+ Text
[ ]
# Create pivot table with the required data
df_largest_product = df.pivot_table(
index = ['Country', 'Country Code','Latitude','Longitude'],
columns = 'Food',
values = '2013',
aggfunc = 'sum',
# Find the most produced product by weight for each country
df_largest_product = df_largest_product.idxmax(axis = 1)
df_largest_product = df_largest_product.reset_index()
df_largest_product.rename({0: 'Top Product'}, axis=1, inplace=True)
# Change Latitude into Degrees from Equator
df_largest_product['Latitude'] = df_largest_product['Latitude'].apply(lambda x: abs(x))
df_largest_product = df_largest_product.rename(columns = {'Latitude': 'Degrees from Equator'})
# Let's add the total food production per 10,000 people
population_2013_df = per_population_df[per_population_df['Year'] == '2013']
population_2013_df = population_2013_df.drop(
['Year','Total Food Production', 'Population', 'Country','Obesity (%)'],
axis = 1
df_largest_product = df_largest_product.merge(population_2013_df, on='Country Code', how='left')
# Let's exclude countries with very small productions as they skew the graphical representation
df_largest_product = df_largest_product[df_largest_product['Per 10,000'] > 5]
df_largest_product['Per 10,000'] = df_largest_product['Per 10,000'].apply(lambda x: round(x,2))
df_largest_product
[14]
+ Code+ Text
# Plot the results on a scatter plot
sns.set_style("white")
g = sns.relplot(
data = df_largest_product,
kind = 'scatter',
x = 'Per 10,000',
y = 'Degrees from Equator',
height=8.27,
s=150,
hue = 'Top Product',
g.set_axis_labels(
"Total Food Production Per 10,000 People ('000 Tonnes)", "Degrees from Equator",
fontsize=15
plt.title("Top Food Product per Country is Restricted by Distance from Equator", fontsize=20)
[15]
Text(0.5, 1.0, 'Top Food Product per Country is Restricted by Distance from Equator')
[15]
+ Code+ Text
Answer 4
The results here are interesting as it seems that there is a correlation between distance from the equator, food production per 10,000 people and the type of top product produced in each country.
Firstly, we can see that there is a relationship between the top product produced in a country and how far it is located from the equator.
While milk seems to be well distributed, starchy roots as a top product is restricted to less than 20 degrees from the equator, cereal & fruits less than 40 degrees and vegetables less than 45 degrees.
The other trend we can see here is a relationship between top food and total production per 10,000 people. This could be due to density. For example, milk is denser than cereals as can be seen above.
Q5. Is there a relationship between food production per 10,000 people and a country's obesity %?
+ Code+ Text
# Set up the data structure & filter
population_obesity_df = population_obesity_df.dropna(axis=0, how='any')
population_obesity_df = population_obesity_df[population_obesity_df['Year'] == '2013']
population_obesity_df['Obesity (%)'] = population_obesity_df['Obesity (%)'].astype(float)
# Again, let's exclude some outliers in the data
population_obesity_df = population_obesity_df[population_obesity_df['Per 10,000'] > 5]
population_obesity_df = population_obesity_df[population_obesity_df['Obesity (%)'] < 40]
# Plot the data
population_obesity_df.plot(
kind = 'scatter',
x = 'Per 10,000',
y = 'Obesity (%)',
figsize=[12, 8],
color='blue',
title='Relationship Between Obesity% and Food Production per 10,000 (\'000 tonnes 2013)'
[16]
[16]
+ Code+ Text
Answer 5
While there does appear to be a weak correlation between food production per capita and obesity, it is difficult to say whether this is a direct relationship. It does seem that countries that produce more than 18 thousand tonnes per 10,000 people all have obesity levels > 15%.
In the previous question, we found that countries which produced more per capita tend to be weighted heavily towards milk production. This could also mean that countries that produce more milk, tend to have higher obesity percentages.
Now let's move our attention to how things have changed over time.
Q6 Which countries produced the most total food in 2013, and how has their production per 10,000 people changed over time?
+ Code+ Text
[ ]
# First find the 2013 top 10 production countries
recent_df = full_df[full_df['Year'] == '2013']
recent_df = recent_df.sort_values(
by = 'Total Food Production',
ascending = False,
recent_df['Per 10,000'] = recent_df['Per 10,000'].apply(lambda x: round(x,2))
# Filter and display the resutls
recent_top10_df = recent_df.iloc[0:10,:]
recent_top10_df
[17]
+ Code+ Text
[ ]
# Filter data to top 10 countries & create required layout
# Let's add Australia in as well for interest
top_10 = ['AUS', 'IND', 'USA', 'BRA', 'IDN', 'NGA', 'PAK', 'MEX', 'DEU', 'CHN']
full_time_df = full_df[full_df['Country Code'].isin(top_10)]
full_time_df = full_time_df.pivot_table(
index='Year',
columns='Country',
values='Per 10,000',
aggfunc='sum',
# Creat a line plot
full_time_df.plot(
kind='line',
title='Change in Production (per 10,000)- Top 10 Food Producing Countries (2013)',
ylabel='Production per 10,000 People (\'000 tonnes)',
figsize=(15,10)
)
Answer 6
The countries which produced the most total amount of food in 2013 are China, India, USA and Brazil.
Over the last 50 years, there has been relatively little change in per capita food production for Australia. However, the top 10 countries have all increased per capita production over this time. In particular, China has increased approx. 3x and Brazil has increased 50%.
Q7. What proportion of the world's food by weight is produced for animal feed vs. human food and how has total production changed over time?
To answer this one, we will use the data prior to filtering out all the 'feed' types. The dataframe was the original food_df.
+ Code+ Text
# Create a pivot tabel to find the total food production for each type
counts_df = food_df.pivot_table(
index = 'Type',
values = '2013',
aggfunc = 'sum'
# Plot as a pie graph
counts_df['2013'].plot(
kind='pie',
title='Proportion of the World\'s Food Production for Human/Animal Consumption',
autopct="%.1f%%",
figsize=(10,7)
ZZEN9021 Principles of Programming
Answer in Detail
Solved by qualified expert
Get Access to This Answer
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Hac habitasse platea dictumst vestibulum rhoncus est pellentesque. Amet dictum sit amet justo donec enim diam vulputate ut. Neque convallis a cras semper auctor neque vitae. Elit at imperdiet dui accumsan. Nisl condimentum id venenatis a condimentum vitae sapien pellentesque. Imperdiet massa tincidunt nunc pulvinar sapien et ligula. Malesuada fames ac turpis egestas maecenas pharetra convallis posuere. Et ultrices neque ornare aenean euismod. Suscipit tellus mauris a diam maecenas sed enim. Potenti nullam ac tortor vitae purus faucibus ornare. Morbi tristique senectus et netus et malesuada. Morbi tristique senectus et netus et malesuada. Tellus pellentesque eu tincidunt tortor aliquam. Sit amet purus gravida quis blandit. Nec feugiat in fermentum posuere urna. Vel orci porta non pulvinar neque laoreet suspendisse interdum. Ultricies tristique nulla aliquet enim tortor at auctor urna. Orci sagittis eu volutpat odio facilisis mauris sit amet.
Tellus molestie nunc non blandit massa enim nec dui. Tellus molestie nunc non blandit massa enim nec dui. Ac tortor vitae purus faucibus ornare suspendisse sed nisi. Pharetra et ultrices neque ornare aenean euismod. Pretium viverra suspendisse potenti nullam ac tortor vitae. Morbi quis commodo odio aenean sed. At consectetur lorem donec massa sapien faucibus et. Nisi quis eleifend quam adipiscing vitae proin sagittis nisl rhoncus. Duis at tellus at urna condimentum mattis pellentesque. Vivamus at augue eget arcu dictum varius duis at. Justo donec enim diam vulputate ut. Blandit libero volutpat sed cras ornare arcu. Ac felis donec et odio pellentesque diam volutpat commodo. Convallis a cras semper auctor neque. Tempus iaculis urna id volutpat lacus. Tortor consequat id porta nibh.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Hac habitasse platea dictumst vestibulum rhoncus est pellentesque. Amet dictum sit amet justo donec enim diam vulputate ut. Neque convallis a cras semper auctor neque vitae. Elit at imperdiet dui accumsan. Nisl condimentum id venenatis a condimentum vitae sapien pellentesque. Imperdiet massa tincidunt nunc pulvinar sapien et ligula. Malesuada fames ac turpis egestas maecenas pharetra convallis posuere. Et ultrices neque ornare aenean euismod. Suscipit tellus mauris a diam maecenas sed enim. Potenti nullam ac tortor vitae purus faucibus ornare. Morbi tristique senectus et netus et malesuada. Morbi tristique senectus et netus et malesuada. Tellus pellentesque eu tincidunt tortor aliquam. Sit amet purus gravida quis blandit. Nec feugiat in fermentum posuere urna. Vel orci porta non pulvinar neque laoreet suspendisse interdum. Ultricies tristique nulla aliquet enim tortor at auctor urna. Orci sagittis eu volutpat odio facilisis mauris sit amet.
Tellus molestie nunc non blandit massa enim nec dui. Tellus molestie nunc non blandit massa enim nec dui. Ac tortor vitae purus faucibus ornare suspendisse sed nisi. Pharetra et ultrices neque ornare aenean euismod. Pretium viverra suspendisse potenti nullam ac tortor vitae. Morbi quis commodo odio aenean sed. At consectetur lorem donec massa sapien faucibus et. Nisi quis eleifend quam adipiscing vitae proin sagittis nisl rhoncus. Duis at tellus at urna condimentum mattis pellentesque. Vivamus at augue eget arcu dictum varius duis at. Justo donec enim diam vulputate ut. Blandit libero volutpat sed cras ornare arcu. Ac felis donec et odio pellentesque diam volutpat commodo. Convallis a cras semper auctor neque. Tempus iaculis urna id volutpat lacus. Tortor consequat id porta nibh.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Hac habitasse platea dictumst vestibulum rhoncus est pellentesque. Amet dictum sit amet justo donec enim diam vulputate ut. Neque convallis a cras semper auctor neque vitae. Elit at imperdiet dui accumsan. Nisl condimentum id venenatis a condimentum vitae sapien pellentesque. Imperdiet massa tincidunt nunc pulvinar sapien et ligula. Malesuada fames ac turpis egestas maecenas pharetra convallis posuere. Et ultrices neque ornare aenean euismod. Suscipit tellus mauris a diam maecenas sed enim. Potenti nullam ac tortor vitae purus faucibus ornare. Morbi tristique senectus et netus et malesuada. Morbi tristique senectus et netus et malesuada. Tellus pellentesque eu tincidunt tortor aliquam. Sit amet purus gravida quis blandit. Nec feugiat in fermentum posuere urna. Vel orci porta non pulvinar neque laoreet suspendisse interdum. Ultricies tristique nulla aliquet enim tortor at auctor urna. Orci sagittis eu volutpat odio facilisis mauris sit amet.
Tellus molestie nunc non blandit massa enim nec dui. Tellus molestie nunc non blandit massa enim nec dui. Ac tortor vitae purus faucibus ornare suspendisse sed nisi. Pharetra et ultrices neque ornare aenean euismod. Pretium viverra suspendisse potenti nullam ac tortor vitae. Morbi quis commodo odio aenean sed. At consectetur lorem donec massa sapien faucibus et. Nisi quis eleifend quam adipiscing vitae proin sagittis nisl rhoncus. Duis at tellus at urna condimentum mattis pellentesque. Vivamus at augue eget arcu dictum varius duis at. Justo donec enim diam vulputate ut. Blandit libero volutpat sed cras ornare arcu. Ac felis donec et odio pellentesque diam volutpat commodo. Convallis a cras semper auctor neque. Tempus iaculis urna id volutpat lacus. Tortor consequat id porta nibh.
214 More Pages to Come in This Document. Get access to the complete answer.