Full-Width Version (true/false)

DATA SCIENCE PROJECT – Coronavirus Tracker Using Python Programming and Tableau

 


                                                              


                      


     Few days back I uploaded a story on my Instagramwhere I did an AMA on a data analysis project that I was working on. As a result, so many of you asked me to doa video on that project so here I am. Though the dataset will be different, butthe project will be pretty much the same. What’s up Internet. I am Manish from RebellionRider.com. Today we are going to do data analysis ofcovid-19 data using Python and tableau.

     Here is how I am planning to proceed withthis project. 1. Data CollectionFirst, I will try to find a dataset. Anything which is continuously getting updatedwith the latest information will work. A huge amount of workflow of this projectdepends on the nature of this dataset. 2. Data Cleaning (if needed)Second, I will look for the anomalies in the dataset and try of fix them. I will aim to find a dataset which has lesserrors and redundancies. But if still there is a need, we will do thecleaning. 3. Data StorageIn the end we will store this dataset in such a way that we can use it with our data analysissoftware which is tableau. Let’s start with gathering the data. While I was searching for the data for thisproject. I came across various good APIs but unfortunately,all were paid. And, all the free APIs were either incomplete,poorly organized, or not getting updated regularly. Now, I am left with only one option whichis web scrapping. Initially my plan was to get the data fromworldometer.info website. I even drafted half of this tutorial aroundthis but just when I started recording this video, I got to know that – the data ofthis website is also not free. Anyways, out of respect, I decided to dropthe already drafted script and code and look for something which is truly free and availablefor any of us to use. So far, one thing which is very clear to meis that finding a good free dataset is next to impossible. Therefore, right now, my aim is to find adataset that is getting updated regularly regardless of it being clean or consistent. The only website which fits the bill is Wikipedia. Free and updated but not the cleanest datafor any of us to use. I admit it’s not my first choice as thedata is not very clean, that means we have to do a lot of work on it. And that also means you are going to learnvarious new things. So, let’s do that –But before that, you can download all the files of this project from my GitHub page. All the necessary links are in the description. 

    Go ahead and check that out. That being said now let’s get back to thetutorial. First let’s quickly create a directory forstoring all our files. Let’s name it covid-19 Analysis. Now let’s launch the IDE. For this project my choice of IDE is goingto be SPYDER. As you can see here. You can use whatever Editor or IDE you want. Now let’s create a file and save it intothe directory that we just created. I will name this file main.pyDone. Next, click here on this menu button and select– “Set console working directory”. Done! Now we are all set to write our code. For web scrapping or fetching the data fromthe website we will be using pandas and request library of python programming. Using request library, we will fetch the datafrom the URL and using Pandas we will save that data into a data frame and process it. Data frames are nothing but a heterogeneoustwo-dimensional tabular data structure. Which means data frames can hold data of differentdatatypes in row and column format. This makes them the best fit structure forholding this type of data. Let’s do that –import pandas as pd import requestsHere I just imported both the libraries that we need –url = 'https://en.wikipedia.org/wiki/COVID-19_pandemic' req = requests.get(url)As you can see, I have declared two variables, URL and req. In the first variable I am saving the URLof the website from which we want to fetch the data. In the next line we call the get method ofrequests library. As the parameter to this function we passthe URL. This function call will fetch all the datafrom the specified URL and return the response object which will be saved into the variable– req. Let’s execute this code -So, we have fetched the data. But this is not the data that we want, ratherit’s the entire text of the website along with all the HTML, CSS formatting and othermeta data. If you will double click on the variable namehere. You can see what it’s holding right now. If you will double click on this parametertext.

         You can see all the web page data that itis holding. Now the question is how are we gonna filterout the data that we need from all this? To do that we will need the correspondingpath value of this text parameter, which is “req.text”. Just note it down and close this window. Done! To filter out the desired data we will beusing read_html function of the Pandas library. This function will look for all the data whichis saved into a table format on the specified web page. After that it will extract that data intoa list of data frames. Let’s do that –Data_list = pd.read_html(req.text) Here we called the read_html function. As the parameter to this function we havespecified the path value of the text parameter that I mentioned earlier. On execution it will fetch all the desireddata and save it into the variable data_list. Let’s execute this code. In the variable explorer you can see thatthe data type of this variable is list. This is not just any list but the list ofdata frames. If you will double click on it then you willsee that it contains 22 different data frames. From these 22 we need to find out the onewhich has our desired data. By desired data I mean country specific covid-19data. If you will look at them carefully, then youwill notice that the data frame number 4 is the largest in size. Let’s double click on it and see what wehave inside it. So, this data frame holds the data that weneed. But not in its cleanest form. We will handle that. But for now, let’s extract this data frameout from this list. target_df = data_list[4]So here I created this variable with the name target_df. On the right hand side of this assignmentstatement I have written down the name of the variable which is holding all our dataframes and followed by that inside the square brackets I have specified the index numberof our target data frame which is 4. Let’s execute this code. 

        Done! Let’s come to the variable explorer andsee what do we have in the variable target_df Great! We successfully extracted our desired dataset. With this we are done with the Data Collection. Now let’s move on to the next step whichis data cleaning. This is going to be the most challenging processbut full of knowledge. And it’s going to be fun. Let’s do it –But before that I just want to ask you to subscribe to the channel if you haven’talready and press the bell icon to get notified every time I upload a new video. And, most importantly don’t forget to pressthe thumbs up button if you like this video. Thanks. Now let’s move on to the next step. Data CleaningLet’s see what we need to clean here. So, here we have two completely irrelevantcolumns. First one is full of NAN values and the lastone is full of reference numbers which are of no use to us. GoodAlso, this dataset has the worst column names. We’ll have to fix those too. Column number 4 has this string ‘No data’stored in some rows. We will replace this with 0. No Issue. I can see some of these country names areending with alphabets enclosed in square brackets. We will need to fix this because it can bea huge Issue while data analysis. Let’s scroll down and check if we have anyIssues there or not. Ok the last two rows have text stored in them. Which is again of no use to us. It’s a small Issue. We will delete these rows. These are a few of the Issues which are visibleto us. The one which is not, is the data type ofthe data which is stored in this data set. The data in these columns seem to be of integerdata type but it’s actually of string data type. Which we need to convert. So, in total we have 6 Issues. Let’s fix them one by one. Let’s start with changing the column namesinto something more meaningful. Issue #1 Column NamesWe will be giving meaningful names only to column number 1,2,3 and 4. Not to column 0 and 5 as these are of no useand we will be getting rid of them. So, let’s do that –Here are the column names that I am planning to assign. Anywas let’s write the code. This is going to be simple-target_df.columns = ['Col0','Country Name','Total Cases', 'Total Deaths', 'Total Recoveries','Col5']This is a pretty simple assignment statement. On the left-hand side, we have the name ofthe variable which has the data frame which we want to update followed by dot (.) andthen the keyword ‘columns’ On the right-hand side, we have the new namesthat we want to give to our columns. Separated by comma, enclosed in single quotesand the entire list of the name is enclosed in square brackets. Let’s execute the code. Done. Let’s check. The column names have been updated. Great. Issue number 1 solved. Now let’s move on to the next one. 

        Issue #2: Extra ColumnsSo, in our data set we have two columns column 0 and column 5 which are of no use. We will be dealing with them by extractingall the columns that we need and saving those into the variable. That way we will leave these extra columnsout from our targeted data frame. Like this –target_df = target_df[['Country Name','Total Cases', 'Total Deaths', 'Total Recoveries']]On execution this statement will extract all the columns that we have specified into thelist and save it into the variable. Which in our case is target_df. Make sure to enclose this list inside a pairof double square brackets. Otherwise you will get an error. Let’s execute and see! Done. Let’s quickly check our target_df variableand see if we have updated columns or not. Great. Now this variable has data only from desiredcolumns. With this, Issue number 2 is also resolved. Now let’s move to next one which is Extrarows of data But before moving ahead let me add some commentsto this script so that it will be more readable. Now this script is looking much better andmore user friendly. Issue #3: Extra RowsAs we saw earlier that there are two rows at the bottom of the data set which are againirrelevant for our data analysis project. Thus, we will remove them. To remove these rows, we will need their correspondingindex numbers. Let’s get those. Let me first open up the target data frame. Ok, these index numbers are 229 for the lastrow and 228 for the second last row. Let’s close this window and write some code. target_df = target_df.drop([229, 228])On execution, this drop function will remove rows corresponding to index number 229 and228. But we cannot use these index numbers. Why? Let’s say tomorrow Wikipedia updates thislist by adding two more countries into this data set. In that case the index numbers correspondingto the last and second last rows will change. And, if that happens then this script is ofno use. The solution to this Issue is not hard codingthe numbers into the statement. Rather fetching the index number dynamically. Like thislast_idx = target_df.index[-1] This statement on execution will return theindex number of the last row of our data frame and save it into the variable last_idx. Now we will use this variable for deletingthese rows. Like thislast_idx = target_df.index[-1] target_df = target_df.drop([last_idx, last_idx-1])The first parameter of the drop function holds the index number of the last row and secondparameter which is last_idx -1 holds the index number of the second last row. Now no matter how many extra rows of dataget added into the data set in the future, this script will always work. Assuming the last two rows will always holdthe irrelevant data. Let’s execute and see if it works or not! Done! Let’s quickly check. Great Last two rows gone! Issue number 3 Solved. Issue #4: Inconsistent Country NameThe solution to this issue is going to be an interesting one. Because to solve it we will be using RegExor Regular Expressions. 

        We will need to write a regular expressionwhich will search all the country names. And look for the one which has string writteninside a square bracket. For example, United Kingdom [UK]. So the Regular expression which best fitssuch strings is this \[.*\]. This regular expression will look for all the strings whichhave anything written inside the square brackets, regardless of its position. Which means it doesn’t matter if the stringenclosed in a square bracket comes in the end or in the middle or even in the startingof the country name. This RegEx will work for all of those. The reason behind doing this is to futureproof this Python Data Processing script. Now how are we going to use it? To modify the country names, we will use thestring function name replace of the Pandas library. Let’s do that –target_df['Country Name'] = target_df['Country Name'].str.replace('\[.*\]','')To use the string function replace we first write the series that we want to update. In our case that series is the names of countrieswhich are stored in ‘Country Name’ column of the target_df data frame. Then we write the keyword str and after thatwe have our function call. Make sure each of these are separated usingdot (.) operator. Here this function will take two parameters. First will be the regular expression and secondwill be the empty string. Just a pair of single quotes, nothing betweenthem, not even space. Because we want to replace the square bracketpart of the country name with nothing. Let’s execute the program and see the result.

         Done! Here we have the country names in the desiredformat. Just to show you the difference here are thebefore and after of our data frame. I must admit this data set has started lookingmuch cleaner. Anyways, issue number 4 is also solved. Now let’s move on to the next issue. Issue #5 Extra Value in Column 4Here is the data frame and as we can see on some rows, we have string “No data” storedin ‘Total Recoveries’ column. To fix this issue we will again use the replacefunction. We will replace the string “No data” with0. But the problem here is that both the parametersof this string function replace, needs to be of string datatype. Don’t worry we will fix those. Meanwhile let’s write the function call– target_df['Total Recoveries'] = target_df['TotalRecoveries'].str.replace('No data','0') On execution this function call will replacethe string “No data” with another string which will be ‘0’. Let’s execute and check the result. Execution done! Let’s check the data frame. So, we have successfully replaced No datawith 0. Now let’s fix the data type of all thesevalues. Issue #6 Wrong Data typesOk, so we know that, the data stored in columns Total Cases, total deaths and total recoveriesis of string data type. However, it should be of integer data type. So, let’s fix it. To change the data type of our data we willuse a function named ‘to_numeric’ of Pandas library. Like thistarget_df['Total Cases']=pd.to_numeric(target_df['Total Cases'])Here pd is the alias that we have assigned to our Pandas library while importing. Followed by that we have our function call. 

        As a parameter to this function we have assignedthe name of the column whose data type we want to change. Similarly let’s write the statements forremaining two columns – target_df['Total Deaths']=pd.to_numeric(target_df['TotalDeaths']) target_df['Total Recoveries']=pd.to_numeric(target_df['TotalRecoveries']) Done! Let’s execute. Great! Let’s check the data set. Nice and colorful. We have done it. Here is the raw dataset that we have fetchedfrom Wikipedia and here is the one that we just cleaned. The result data set is already looking muchbetter. With this we are done with the first two stepsof data processing which are data collection and data cleaning. Now, let’s perform the last step which isdata organizing. Export the DataNow we are in the last phase of our data processing. Only thing which is now left is to save thedata in a way that our data analysis software can use it. For our data analysis we are using tableau. The two most widely accepted file formatsby tableau are CSV and XLSX. So, you can export your dataset in eitherof these. For further use I will export my data in XLSXformat. However, for the sake of learning I will showyou how you can save your dataset in both these formats. So, let’s do that –Exporting our dataset in either of these formats is very easy. 

        For exampleTo export the dataset in CSV we simply have to call the to_csv() function. Like this –target_df.to_csv(r'covid19_dataset.csv') And to export the dataset in XLSX format ofMicrosoft excel we just have to call to_excel() function. Like this –target_df.to_excel(r' covid19_dataset.xlsx ')In both these function calls covid19_dataset is the name of the file. Make sure to specify the name of the filewith the extension. Just like I did here, in the first functioncall the extension is .csv and in the second it’s .xlsx. Also, make sure to put r before specifyingthe name. This flag lets the target software interpretthe data easily. Anyways, I will comment the first functioncall as I want to export my data in XLSX format. Done! Let’s give this program the final execution. Done! Let’s check our project director and seeif we have our dataset saved into a Microsoft excel file or not. Here is our data. Nicely saved into our desired format. Let’s open up this file and see our data. Here it is. Great! Everything is done just like we want. So, we have our dataset saved. Now we will put it into tableau and draw someinteresting charts. There are multiple ways of visualizing oranalyzing this data. Say, we can plot our charts using librarieslike Matplotlib or Geopanda. Or we can use software like QGIS or ARCGIS. But for this tutorial I am using Tableau. First of all, it has a version which you candownload for free. Second, it’s much easier than other optionswhich I just mentioned. We don’t need to further code or generatesets of files for analysis. 

        We simply have to take the XLSX file and processit with the software. So let’s do that. But if you want me to do a tutorial usingmatplotlib, geopandas or QGIS then just go ahead and comment and let me know. That being said let’s open up the tableauand draw some interesting charts. First let’s minimize this python script. Now let’s open up the tableau. Great, Tableau is open next we have to loadthe data source. This will be the excel file that we generated. Now come to your left-hand side menu. If you have exported your file in XLX format,then select Microsoft Excel. But if you have exported your file in CSVformat then select Text File. In my case I have exported my data in XLSXformat therefore I will select the Microsoft Excel option. Next you know what you have to do. Search your file and open it. So, here is my file. Double click and open it. Done! Here we will be plotting two graphs. First will be a Map and second will be a TotalDeaths Vs Total Recoveries Chart. Let’s start with the Map. First click on this sheet 1. Great! Now come to the dimensions and double clickon Country Names. You will have your MAP plotted on your screen. Before moving ahead – No matter how muchyou clean your data something always stays behind. As you can see, we have 17 unknow values inour dataset. These 17 values contain the names of CruiseShips and island which are outside the scope of Tableau’s map. If you have latest version of tableau thenchances are there – that you may not have these errors. Anyways, excepts from these values we havethe data from almost all the countries. Which is more than enough for us. Now I want my Map to show only total numberof cases. So, select Total Cases and drag it onto thisbutton called Label. If you want to format the text then also youcan use this button. Simply click on it and make your changes. Like this. Now it’s looking better. Now our countries’ names are displayed twice. Let’s fix it. Go to Map Menu and Select Map layers. First let’s change the style. Let’s set it to outdoor. Looking Good. Let’s uncheck this Cities and Country/RegionName checkbox. Now this map is pretty clean. Now let’s quickly draw Total RecoveriesVs Total Deaths Chart. For that create another Sheet. Now again take Country from Dimensions Paneland drag it to columns. From measures take total Recoveries and dragit to rows. Now sort the data. Now the Chart is looking pretty good. Since the chart is showing the number of Recoveriesthus let’s change its color from Blue to Green. 

           GoodNext I want to add a filter so that we can only see those countries which fit into arange. Let’s say I want to see only those countrieswhere number of recoveries are over 1000. For that, again to measures and drag totalRecoveries to filters. Then Select Sum and Press Next. Here Select Range of Values and in this boxspecify 1000. Don’t worry you can change it later. Now press apply and ok. Next right-click on Sum and Select Show filter. Now using this panel, you can see the countrieswith specific values. Also, come to this show mark labels buttonand press it. Now we can see the total number of recoverieson our chart. Next, we will be repeating all these stepsbut for total deaths. Therefore, again go to Measures and selectTotal Deaths and drag it to Rows. Since this chart is showing total number ofdeaths thus, I will change its color to Red. We can also add a filter to it. Take Total deaths from the measures and dragit to Filters. Repeat all the steps. Great. Now again right click on the filter and selectshow filter. Done. Now come here and select entire view. Ok. Now let’s create a dashboard. Now we will use these sheets to make a dashboard. For that come here and select the New Dashboardoption. Before doing anything. Come to this Size panel and set it to Automatic. Now take Vertical object and drag it to thesheet and repeat the step one more time. DoneNow take sheet number 1 and drag it to the top vertical and take sheet number 2 and dragit to the bottom vertical. Now both these sheets are individual sheetsnot connected to each other. But I want that if I select a country hereit should also be displayed on the map and vice versa then it will get displayed in themap also and vice versa. Let’s do that –Go to this menu “Worksheet” and select “Action”. Click Add action and Select Filter. Come to this Source Sheet Section. Here First Select The Name of your dashboardand then select the sheet which has your bar graph. Also, in run action choose “Select”. Now come to Target Sheet and again selectyour dashboard and then select the sheet which has your Map. Also, choose show all values. Make sure All Field is selected in TargetFilter. After that Click OK and exit. 

        Now whatever country you select here in yourbar graph is also selected in your Map. Like this. Now quickly create another action which willwork in the reverse manner. Which means whatever country you select inthe map its corresponding entry in the bar graph is also selected automatically. This time the Source Sheet that you will beselecting is the one which has your MAP and the target sheet is the one which has yourBar Graph. Except from this everything will be the same. Done! Now let’s say you want to see the deathand recovery rate in Russia. For that simply select Russia and its correspondingentry will be on your screen like this. Similarly you can create charts and figuresand customize your dashboard and do further data analysis but for this project I willstop here. So that’s all you need to do in this project. Here we used Python Programming for Data Processingand Tableau for Data Analysis. If you liked this video please hit the thumbsup button. Also subscribe to the channel for more suchinteresting tutorials. And don’t forget to press the bell iconto get notified next time I upload a video. Thanks for watching. This is Manish from RebellionRider.com 


Code here:- check here

Post a Comment

0 Comments