University of Rhode Island
Department of Biological Sciences
College of the Environment and Life Sciences
BIO 439X/594-14
Big Data Analysis
Spring 2017

Instructor: Dr. Rachel Schwartz

Office Location: CBLS 377

Telephone: 4-5404

Email: rsschwartz@uri.edu
Please note that I will do my best to respond promptly between 9-5 on weekdays.

Office Hours: Friday 12-2pm CBLS 252
Please feel free to come by without an appointment during this time. To talk with me during other times please set up a meeting by email.

Class Days / Time: T/Th 1:30-2:45pm

Classroom: Coastal Institute 117A

Prerequisites: Graduate standing (BIO594) or junior standing (BIO439X) or instructor permission.

Course Description: This course is intended to help you learn how to analyze large datasets correctly and efficiently. We will discuss methods for analysis of big data, and how to getresearch done more efficiently using basic scientific computing skills. This course will consist of limited lecture time and extensive hands-on time. We will cover data management, statistical methods, task automation, and how to make data analysis clear and reproducible. No prior programming experience is required. Although most of the data will be biological orenvironmental, the material learned will be fully translatable to other fields.

Course Goals:

At the end of this course students should be able to...

Student Learning Outcomes:

At the end of this course students enrolled in both courses should be able to...

Students enrolled in BIO594 should also be able to...

Required text

Bioinformatics Data Skills. Vince Buffalo. 2015. O'Reilly Media.

R for Data Science. Hadley Wickham and Garrett Grolemund. 2016. free online at http://r4ds.had.co.nz/

Data Science from Scratch. Joel Grus. 2015. O’Reilly Media. (available free online via the library and Safari Books)

The Practice of Reproducible Research. Justin Kitzes, Daniel Turek, Fatma Deniz (Eds.) 2017. free online at https://www.practicereproducibleresearch.org/ UC Press

A Whirlwind Tour of Python. Jake VanderPlas. 2016. Available free at https://github.com/jakevdp/WhirlwindTourOfPython

Python Data Science Handbook. Jake VanderPlas. available free at https://github.com/jakevdp/PythonDataScienceHandbook

Interesting supplementary reads

Data Science at the Command Line. Jeroen Janssens. 2014. O'Reilly Media.

Python Data Science Handbook. Jake VanderPlas. 2015. O’Reilly Media.

Doing Data Science. Rachel Schutt and Cathy O'Neill. 2013. O'Reilly Media.

Big Data and Social Science: A Practical Guide to Methods and Tools. 2016. Ian Foster, Rayid Ghani, Ron S. Jarmin, Frauke Kreuter, and Julia Lane, eds. CRC Press.

On Being a Data Skeptic. Cathy O'Neil. 2013. O'Reilly Media.

Other equipment / material requirements

Mac, Linux, or Windows laptop (not a tablet, Chromebook, etc.) with administrative privileges. Please contact the instructor if you do not have a laptop and purchasing one would be a financial difficulty.

Meeting Schedule

Note: Updated April 3, 2017 - Still subject to change

Week Topic Process Readings Assignment
1 (1/23) Introduction / Software installation Kitzes et al. P1
Automating repetitive tasks Intro to the command line Buffalo Ch. 3
2 (1/30) Automating repetitive tasks Shell scripts Buffalo Ch. 7 Commenting a shell script
3 (2/6) Big data and answerable questions Discussion of your research Buffalo Ch. 1
High performance computing Intro to the cluster Buffalo Ch. 4
4 (2/13) Sharing, collaboration, and research transparancy Git Buffalo Ch. 5 Share your git repo
5 (2/20) Coding for data analysis Intro to R and RStudio Buffalo Ch. 8
Data organization and management R: functions Buffalo Ch. 2, Kitzes P1
Data visualization for exploration R: ggplot2 Wickham ch.2,3,28
6 (2/27) Good data practice R Markdown Wickham ch. 27,29,30 Exploratory analysis
Analyzing data by category R: plyr/dplyr Wickham ch. 5
7 (3/6) Tidy Data and Relational data R: dplyr,tidyr Wickham ch. 12,13
Testing and Scripting R - shell Buffalo ch. 8
8 (3/20) Recap
Appropriate analyses and statistics Foster et al.
9 (3/27) Automating repetitive tasks Python: intro Grus Ch. 2 / Vanderplas: whirlwind
Pattern counting Python: basics Vanderplas ch. 1 Pattern counting
10 (4/3) Data structures and complex data Python: pandas Vanderplas ch. 2-3
11 (4/10) Packaging complex data Python: pandas joins
Making sure it's right Python: tests Complete data analysis
12 (4/17) Review and work on projects
Command line Python: scripts and tests
13 (4/24) Visualization Python: matplotlib
Wrap up

Assignments and Grading Policy

There will be five assignments during the course of the semester. Each assignment will focus on the skill learned in class during the week and build on skills and concepts learned in previous weeks. These assignments will be initiated in class and are due at the end of the following week.

For each assignment you will earn points. The more points you have earned, the more you have learned. Your grade will be based on the total number of points you accumulate during the semester. Assignments will be worth a total of 500 points. Students in BIO594 will also do a cummulative project involving two of the following: (1) a detailed written proposal on research involving big data, (2) analysis of research data, (3) a ~1000 word paper on your research using data, and/or (4) a poster of your research with data. The project will be worth 500 points.

Grading for BIO594
A 900-1000
A- 880-899
B+ 860-879
B 800-859
B- 780-799
C+ 760-779
C 700-759
C- 680-699
D+ 660-679
D 600-659
F <600

Grading for BIO439X
A 450-500
A- 440-449
B+ 430-439
B 400-429
B- 390-399
C+ 380-389
C 350-379
C- 340-349
D+ 330-339
D 300-329
F <300

Instructor Policies

Any student with a documented disability should contact me as soon as possible so that we may arrange reasonable accommodations. As part of this process, please be in touch with Disability Services for Students Office at 302 Memorial Union, Phone 401-874-2098.

Students are expected to treat faculty and fellow classmates with dignity and respect.

Students are responsible for being familiar with and adhering to the published “Student Code of Conduct” which can be accessed in the University Student Handbook (http://web.uri.edu/studentconduct/student-handbook/).

Students are expected to be honest in all academic work. A student’s name on any written work, quiz or exam shall be regarded as assurance that the work is the result of the student’s own independent thought and study. Work should be stated in the student’s own words, properly attributed to its source.   Students have an obligation to know how to quote, paraphrase, summarize, cite and reference the work of others with integrity. The following are examples of academic dishonesty.

Classroom Protocol

Bring your laptop to class every day.