On this page, I answer FAQs that many scientists have asked me. If you have a question that isn’t answered here, please feel free to email me or send me a Teams message!
This document is a work in progress! Last edited 2023-11-27.
Please contact me by email (quentin dot read at usda dot gov) or Microsoft Teams message. I will do my best to get back to you as soon as possible!
I can provide you a better answer if I know what the goals of your research are, and if I know what your data look like. I would greatly appreciate it if your help request would include at least a few sentences describing the research, including the goal of the research and what specific research questions you are exploring/hypotheses you are testing. Also, if you can provide at least a sample of your raw data so that I can see what format it is in and what kind of variables we will be working with, that’s very helpful too. If you have anything like a field map or spreadsheet of treatment assignments that helps clarify the experimental design, that’s also helpful for me to look at.
With all of that said, don’t worry too much about providing every single possible piece of information. But if I have the info I need to help you ahead of time, it can make our consultation meetings much more efficient and productive!
It makes it a lot more efficient for both of us if you can provide data in a format that is ready to be imported into statistical software like R or SAS. That maximizes the amount of time I can spend helping you with data analysis, visualization, modeling, and storytelling. You know your data better than I do, so if you are the one who takes the lead in cleaning and formatting the data, there is less potential for error.
I would prefer to have data in a “tidy” format, which means:
See this excellent guide to sharing data with a statistician for more details on the best format for sharing data.
It ranges the whole gamut from a quick email or 10-minute conversation, to a collaboration that can last for months or years. I can answer questions you have, help point you toward resources that can help you learn about the stats or models you need, or review code or text you’ve written to make sure it’s correct. If needed, I can do some analysis for you, or even take the lead on the entire data manipulation, analysis, and presentation workflow. It really depends on your needs. Every project is different! But no matter what, it is a “co-creation” process where we will work together to use your data to tell the story you want to tell.
I am a big proponent of Bayesian methods. They are more flexible and allow us to fit models that classical statistical approaches just can’t handle. Also, philosophically it’s a better way to approach science: classical statistics tries to reject or not reject a null hypothesis, which gives the false impression that the world is black and white and there are “yes or no” answers to our hypotheses about the world. Bayesian statistics is more about estimating the size of the effects and being honest about the level of uncertainty we have for any claim we make about the world. Of course, I know many people haven’t been trained in that approach, so I am happy to work with you to learn more about it. Even if you don’t become a Bayesian, it’s important to at least be familiar with the terminology and the ideas behind it because you will start to see it more and more in the literature as time goes on.
Whether we’re working with Bayesian or classical models, I really like GLMMs (generalized linear mixed models). They are a very flexible kind of model that allow us to work with data with all kinds of non-independence in space and time, and all kinds of distributions.
Bayesian stats and GLMMs are best for “small data” or medium-sized data. When it comes to big data, we have to move to machine learning approaches. As I said above, I am not an expert in those but I am excited to learn with you!
I primarily use R. If I do an analysis “from scratch” for a scientist, I usually do the analysis in R and write it up as an RMarkdown notebook. That’s a document that includes code, output, figures, and explanatory text all in one place. I find that this is the best way to share my work with scientists. What R packages do I usually use? I do most data manipulation using tidyverse but also use data.table for larger datasets. I use the lme4 or nlme packages for classical statistical analyses, and Stan software coupled with the R packages brms and tidybayes for Bayesian analyses. emmeans and easystats are great packages for supporting all kinds of analyses. For geospatial data stuff, I use the sf package in R as well as occasionally using GDAL and GEOS on the command line.
I am also somewhat experienced with SAS and capable of helping you with your SAS code, as well as Python to a lesser extent. I can also help you with your JMP analysis. But I encourage ARS scientists to use R or at least to familiarize yourselves with it.
Yes, I have some experience using SciNet and other high-performance computing clusters, and I can probably provide you some help. But for more involved requests, I’d recommend getting in touch with folks from GBRU or asking your question on the SciNet forums.
I have a lot of ongoing commitments to help out scientists at any given time. I work on a first-come, first-served basis. But I do want to make regular progress on all the projects. So I cycle through all my currently active projects and work on each one for a chunk of time. Currently I’m able to work on each project roughly every 1-2 weeks. Ideally, I would make enough progress each time to send you an update. But typically I will only be able to devote a small percent of time to a specific project in any one week. Feel free to email me at any time with questions or clarifications; again I’ll address those on a first-come first-served basis as they come in.
Of course, I am willing to make exceptions if there is a rush deadline. The sooner you can let me know, the better, so that I can plan accordingly.
I do not have a formal publication quota, but I am informally evaluated in part based on the publications and presentations I co-author. Of course, my contribution will vary a lot from project to project. Please consider adding me as a co-author on any paper or other product where I’ve made a meaningful contribution to the analysis, presentation, and/or writing. This is a good idea because it makes me formally accountable for the analysis I did or helped you do. If I am a co-author, I promise to hold up my end of the bargain and write any sections for which I am responsible, including creating figures and tables. Like any good co-author, I will review and give comments on the entire text of the manuscript. I’ll pay special attention to making sure statements in the abstract and discussion are supported by the analysis results. But if my contribution to your project is just a quick consultation or question-and-answer session, co-authorship is not necessary. An informal acknowledgment would be great!
Yes! I am passionate about promoting open and reproducible science in ARS. It’s especially important now that the White House has mandated we make all our data publicly available. That should also include the code that turns raw data into a final product with the results of an analysis. I encourage the use of GitHub. I will help you create a private GitHub repository where we can share code and collaborate on our project. When it is time to publish the manuscript, I will help you archive the code from the repository as well as the raw data on Ag Data Commons, the USDA’s own data repository. This will ensure that the code and data we produce at USDA provide the biggest possible benefit to society. Soon, it will be a requirement to file a 115 for Ag Data Commons entries; this will help us get credit for the additional work that it takes to make our science open and reproducible.
I am officially responsible for reviewing all 5-year CRIS project plans for the Southeast Area. The program analysts send me the preplans for each review cycle and I provide comments and feedback, primarily focusing on the experimental design, proposed statistical analyses, and data analysis/management parts of the plans. But if you want to get a head start on the process, I can help at an earlier stage if you send me questions about specific elements of your plan such as experimental design or power calculations.
Incidentally, it isn’t necessary to list me as a collaborator on your preplan. I am always available to provide statistical support to SEA scientists, whether or not my name appears on your plan.
Learning stats is a journey and a process. You can’t learn it overnight. However, I would recommend starting at my SEAStats training page for a gentle introduction to both the statistical models and the tools in R you will need to work with them. On that page I also have links to other helpful tutorials and learning resources. Also, check out the free online training page on SciNet that my area statistician colleagues Sara and Kathy put together with tons of resources!
If you would like me to teach a workshop on a topic related to statistics, data science, or statistical programming, I am available for either in-person workshops at SEA locations or virtually via Teams/Zoom.
There are three ways I can do a workshop: teach a lesson/short presentation from my lesson page, teach from someone else’s tutorial or learning resource, or do a “bespoke” lesson on a topic of your choice.
I can teach one or more of the lessons that are already on the SEAStats training page. Here is a rundown of what’s currently available there. For the “multi-part” lessons, we can do all parts or only a subset depending on how much time is available. You can also see a list of talks and presentations on the SEAStats page.
|R Boot Camp: the basics of R programming and working with dataframes||2 lessons, 90 minutes each|
|Mixed Models in R: linear mixed models in R, including simple GLMMs and emmeans||4 lessons, 90 minutes each|
|ggplot2 Basics: a brief introduction to the ggplot2 plotting package in R||1 lesson, 90 minutes|
|Bayesian Mixed Models with brms: introduction to Bayesian stats, with a mixed model example in the brms R package||1 lesson, 3 hours|
|R for SAS Users: intro to mixed models in R, assuming a SAS background||2 lessons, 90 minutes each|
There are lots of great resources available for learning about stats, data science, and scientific programming. I would be happy to lead a workshop where we go through an existing tutorial together. If there is enough interest, we could even work through an entire book in a series of workshops. Here are a few examples, but feel free to suggest one of your own.
If you are interested in a topic that you cannot find a good learning resource for, I can probably help you find some good resources and lead a workshop where we go through them together. I am also open to developing new lessons if you give me enough advance notice!