SQL data cleaning method for Google Data Analytics Case Study 1 - Bike share
Case study csv cleaning and joining process
So, when I was working on my case study I felt overwhelmed and had a difficult time cleaning the data. As I was working thru the data, I had no idea really how to combine the csv files and clean them in a reasonable time. I did this process in a very difficult way. I opened each csv file and individually cleaned the data in Excel. At the time, I could think of nothing else. It was a time consuming process, and a waste of time. Now that I am done with the project, I have found a much more time efficient way to complete this part of the project.
This part of the process can all be done in SQL. SQlite Browser is the tool to use as it makes importing csv files and creating a table from them easy. It uses standard SQL code, and I will show you the code that I used to quickly get rid of rows with empty cells and to pull a random sample based on your sample calculator output.
1. Install SQlite Browser if you haven't already
2. Place all raw csv files into one folder for easy access
3. Create New Database in SQlite
4. File-Import-Table from csv file
5. Name the new table and make sure the Column names in first line is check marked
Click ok to create.
6. Click Execute SQL tab and type in the code that is in the image. Make sure that you type in your table name that you created in From
The Is NOT Null eliminates any Null(empty) values in each column. Yes you have to do this for each column. I did not find an easier way to do this. Please leave a comment if you know of an easier way to eliminate Null values!
7. Now that we have the Null's eliminated, we can pull a sample from the database table. I am pulling a sample so that I can have a table small enough to work the rest of the way in Excel and R. Go to Sample Size Calculator and input your values. Here is mine.
8. Use a query to return the sample size.
Add:
Order by random()
Limit 1953; -the sample size number from the calculator
Execute the query
9. Almost done. Click Export to csv(3 icons to the right of the execute icon)
Save csv in a folder. Now you have the option to do further work in Excel or R. If you use Excel, open the newly created csv file and do a save as Excel Workbook in a new name. Now you are good to complete the rest of the project. Don't forget to go thru the data in Excel to make sure there are no typos and such before doing any work.
I believe that it is easier to create the new calculated columns in Excel, rather than trying to create these in SQL. I did not have much luck doing so, still learning SQL myself. I just wish I started the project this way rather than the hard way I cleaned the data. Good luck!
Comments
Post a Comment