Planning and Getting Started
We began on Monday with introductions and a tour of campus. We reviewed some concepts about machine learning and big data and then met with our mentors to start getting background for our projects. Dr. Zarella, has two ideas for my project, since I have some background already as a result of working in the program last year. The first possible project would be a continuation of my previous work in converting his programs into Python. The other option would be to reproduce some work from researchers at UNC and test new images against a program created by that group to evaluate its performance.
I spent the day on Tuesday reading the paper and reviewing the code produced by the UNC researchers for their paper titled Image analysis with deep learning to predict breast cancer grade, ER status, histologic subtype, and intrinsic subtype. I also began reviewing the code produced during the REThink program last year to ensure it still worked with updated dependencies and to refresh my memory about how it was structured. I was able to get this to run properly and began to assess some of the performance bottlenecks in the application.
On Wednesday, I applied to get the data sets used to train and predict tumor pathologies that were used in the UNC paper. One of these was provided to me by an automated system. The other requires a human reviewer and should take several days to receive a response. I am not optimistic about this set. We met at Dr. Zarella's lab to discuss options and background on the work so that I might see what projects he and his research students are currently undertaking.
By Thursday I decided to evaluate the UNC program with new datasets as my project and created my project proposal and discussed this with Dr. Zarella. I spent the day trying to pre-process the image data set to be run on with the machine learning program. The program used in the UNC paper requires several files to be in specific places and data to be organized in a particular way. There are several Python scripts used to prepare and pre-process this data but these require significant storage space and take some time to run. I have prepared virtual machines to run the programs to isolate any dependencies and to ensure I have a clean development environment but I have needed to clean up space on my computer and create several failed virtual machines as experiments.
We began with a presentation and then had project status updates on Friday. We met with the program advisers and discussed our research plans and goals and then went our for a group lunch. In the afternoon we continued working on our projects and I was finally able to get the pre-processing scripts to complete on my raw data sets.