Crombie Tools
Analysis Workspace

Describes the layout of an analysis workspace created by crombie workspace.

A workspace is where all analysis-specific configurations are placed. You can create a workspace with the name Workspace like the following.

mkdir Workspace
cd Workspace
crombie workspace


A new workspace will be created from the CrombieTools/templates directory and have the following layout.

Workspace/
|-- docs
|   |-- Workspace.tex
|   |-- figs
|   -- presentation.tex
|-- plotter
|   |-- CrombiePlotterConfig.sh
|   |-- MCConfig.txt
|   -- cuts.py
-- slimmer
|-- CrombieSlimmingConfig.sh
|-- FlatSkimmer.sh
|-- JobScriptList.txt
-- runSlimmer.py


Each subdirectory has distinct function, which are described below.

# Slimming

The first thing that will often be needed in an analysis is slimming files into flat trees. There are two essential steps that this layout assumes you need.

1. Changing the file format The user can go from some large, inclusive ntuples to something small and flat. Crombie Tools currently supports two methods to do this in batch jobs. One is through submissions through LXBATCH to run on files on EOS. The other method is to run on files interactively in a terminal. To allow the user to run this step less often, they are encouraged to not perform too much skimming of events in this step, as that can performed in the next step.

2. Skimming events from trees This can be done with a simple cut string, as well as using a Good Run list.

To understand how to do these steps, take a look inside of the file slimmer/CrombieSlimmingConfig.sh. The variables all start with Crombie to ensure that the names do not overlap with any environment variables that may be set by other tools used by the user. The meanings of each variable is listed below.

# Environment Variables

These are the environment variables used in the slimming and skimming of an analysis.

 CrombieFilesPerJob This specifies the number of files on EOS each LXBATCH job will read. Keep this constant between resubmissions, because it directly determines which files are run together. The other variables setting up the queue and number of cores won't change anything. CrombieQueue Specifies which LXBATCH queue will be submitted to. Examples are 8nm, 1nh, 8nh, 1nd, 2nd, 1nw, 2nw, and 2nw4cores. CrombieNLocalProcs The number of processors used for local slimming, skimming, plotting, etc. The default number uses all available processors, which is perhaps not always desired. CrombieFileBase Each file name of the flat files you'll work with start with _ and that base is set here so that your LXBATCH jobs make the correct name. These namespaces are useful for knowing what each ntuple was used for previously. CrombieEosDir This can be one of two things. A directory on EOS where you will look for all datasets. A local .txt file which has a list of directories to look for datasets. The submission tool will figure out which one of the two you set this variable as by checking if a local file with that name exists. CrombieRegDir crombie terminalslim does not run on EOS necessarily, like crombie submitlxbatch does. This is the variable that sets what folder to look in for datasets. Note that this only takes a relative directory, not a list of directories. CrombieTempDir A location to store the direct LXBATCH output. This is the directory that will be checked when the tool is trying to determine what files have and have not been successfully created. CrombieFullDir This is simply location of the hadded LXBATCH output. That is, all the files of the same dataset will be combined. This directory will also hold a list of the original locations of the datasets on EOS so that differences in dataset location can be detected for separate runs. CrombieSkimDir This is the location of the flat trees run through a good runs skim as well as any other cuts added. This is not a necessary variable, but is used in the template of FlatSkimmer.sh. CrombieDirList If left blank, all of the datasets in the CrombieEosDir or CrombieRegDir will be run on. Otherwise, this variable should name a local .txt file that has a list of datasets that you want to run on. CrombieSlimmerScript This names the script that the LXBATCH job will run. Make sure that the script is executable (chmod +x), and that it takes two arguments: The input file name The output file name If no arguments are passed, make sure the script compiles everything that it needs (with LoadMacro('..+') for example). CrombieJobScriptList This variable names a local .txt file that names the relative paths of all files that should be copied to the LXBATCH node for the job to be completed. This will often include macros and headers needed. All of these files must be in the slimmer subdirectory. Full path names or using .. will not work. CrombieCheckerScript Names a script that checks the output of each file run on in the LXBATCH job. The script should return a non-zero exit code if there is a problem with the output file. The script should return exit code 5 for fatal errors that will abort the job. The job will also abort for any non-zero exit code for the hadded output of the LXBATCH job. These errors will be reported in a local file, LxbatchFileChecks.log. CrombieGoodRuns This is a variable also only used by the template of FlatSkimmer.sh, so it's optional. It names the location of the good runs JSON for the the skimmer to use.

# Generating flat trees for output

This is now deprecated, see here for documentation of new tool.

There is a tool for generating flat tree classes for the user. The variables that you want to include in a flat tree can be specified in OutTree.txt or whatever you rename it to. The format of the configuration file is <branchName>/<type>=<default>. <branchName> should be easy to understand. Valid entries for <type> are the following:

 F float I int i unsigned int L long l unsigned long O bool

Any other types will be assumed to be TObjects. The header file for the listed TObject will be included automatically. This feature is not tested extensively since I don't use it.

You can also preface a type with V for (pointers of) vectors of these types. After writing a tree configuration file with name OutTree.txt, just run

crombie oldmaketree OutTree


This makes a class that contains your tree. This is done automatically for you in the default runSlimmer.py template. You can call each branch of the tree via a public member with the same name of the <branchName> and fill the whole tree with function Fill() at the end of each event. If you do not set a value for a particular event, the branch will be filled with <default>. You can then write the tree to a file via WriteToFile(*TFile,"<WhatYouWantToCallTree>"). There's also other overloaded write and creation function using a file name you specifiy with the initializer. See slimmer.cc for an example of how to write a flat tree using this class.

# Skimming

The environment variables used for skimming the flat trees afterwards are actually optional, if you edit the file slimmer/FlatSkimmer.sh to accommodate that. The user is encouraged to edit FlatSkimmer.sh, which makes use of a flexible tool, crombie skim. Note, crombie skim is not included in the command line references because it is not recommended that this command is used interactively. It is much more efficient and stable to create a script like FlatSkimmer.sh. crombie skim takes files from an input directory, skims them, and places them in a separate directory. Here is the help message to help you understand how to customize this.

usage: crombie skim [-h] [--numproc NUM] [--indir DIR] [--outdir DIR]
[--json FILE] [--cut CUT] [--tree NAME]
[--copy [NAMES [NAMES ...]]] [--run EXPR] [--lumi EXPR]
[--freq NUM] [--filters [FILE [FILE ...]]] [--duplicate]

Slims the contents of one directory into another one

optional arguments:
-h, --help            show this help message and exit
--numproc NUM, -n NUM
Number of processes that FlatSkimmer will spawn.
--indir DIR, -i DIR   Directory that contains input files to be slimmed.
--outdir DIR, -o DIR  Directory where slimmed stuff will be placed.
--json FILE, -j FILE  Good runs json file location to be used.
--cut CUT, -c CUT     Cut used in slimming.
--tree NAME, -t NAME  Name of tree that will be slimmed.
--copy [NAMES [NAMES ...]]
List other object names to copy into the slimmed file.
--run EXPR, -r EXPR   Set the expression for Run Number.
--lumi EXPR, -l EXPR  Set the expression for Lumi Number.
--freq NUM, -f NUM    Set the reporting frequency.
--filters [FILE [FILE ...]], -e [FILE [FILE ...]]
Set the filter files.
--duplicate, -d       Turn on duplicate checking.


After running FlatSkimmer.sh, you should have your small ntuples ready to work with.

# Plotting

The next subdirectory of a workspace is the plotting directory. This comes with it's own list of environment variables.

 CrombieMCConfig This names the file that will be read to set the backgrounds for plots, limit trees, datacards, etc. See the MC Configuration for details on how to set this up this background configuration. CrombieSignalConfig This variable names a file like CrombieMCConfig, but files listed in the named location are assumed to be signal files, not background files. CrombieExcept_* This variable names a region in the * location, and this variable points to a file that designates replacements to the background configuration for the named region. See the instructions for Adjustment Configuration below to set up this file correctly. CrombieLuminosity This variable just gives the luminosity used to make plots and limit workspaces. CrombieInFilesDir This variable names the directory containing the ntuple files that are being used for plotting and tree making. It should usually match the latest value you had for CrombieSkimDir. CrombieOutPlotDir This variable names the directory where an automatically configured PlotStack object will place all of the plots. CrombieOutLimitTreeDir This variable names the directory where an automatically configured LimitTreeMaker object will place all of the tree files. CrombieCutsFile This file gives the name of the python file to be loaded as the function CrombieTools.LoadConfig.cuts when the user imports CrombieTools.LoadConfig in a working directory with this environment variable set.

# Formatting MC Configuration Files

Each analysis will probably make use of multiple MC Samples. You can keep track of those all with one simple MC Config.

## Base Configuration

You will generally have one main configuration file with most of your background samples listed. Signal samples should be kept in a separate configuration file, since these will be marked as signal or background when read. Each sample should be contained in a single .root file. The MC Config will keep track of these files, one row at a time. The order of the elements should be this:

<LimitTreeName>    <FileName>    <CrossSection>    <LegendEntry>    <FillColorOrLineStyle>


The elements are space delimited.

 Limit Tree Name This is the base of the tree that will be made by LimitTreeMaker for this file. The name should be unique for each file if using LimitTreeMaker. For other analyses, this can instead be used only to differentiate signal and background. In this case, putting . in the config file will copy the previous line. File Name This is the name of the file for the given sample. The file name does not need to be absolute, as the input directory is set in FileConfigReader::SetInDirectory(), usually by reading the environment configuration. Cross Section This should be the cross section of the sample in pb. Legend Entry This is the legend entry that will be made in all of the stack plots using this config for the given file. If you want to have spaces in your legend entry, place _ instead (since the elements of the config are space delimited). These are all replaced with spaces by FileConfigReader::ReadMCConfig(). Legend entries being repeated next to each other will cause multiple files to merge into the same stack element. A shortcut to using the legend entry in the previous line is to just put . as the Legend Entry. Fill Color or Line Style For background MC, this specifies the fill color, using the Color_t enums from ROOT. If you wish to give a custom RGB color, just make this entry rgb and follow that with the red, blue, and green components space delimited out of 255. If the legend entry of this line matches the entry in the previous line, the color is ignored (but must still be in the config). Again, placing a . in this case is a useful shortcut. For signal samples, this entry should give the linestyle you wish to use for the sample.

To avoid having duplicate entries in multple configurations, there is an easy way to switch out MC samples for different ones, while keeping the rest of the samples the same. If a line starts with the keyword skip instead of a tree name and then lists a file, the MCReader will erase the MCFileInfo for that file. A line like this simply contains:

skip    <FileName>


A configuration file with lines like this can also contain lines like those in the base configuration. This makes it easy to swap out files. After reading one config, just read the adjusting configuration after before making limit trees or plotting.

## Merging MC Samples

MC samples that are created using different generators or hadronizers, etc. can be easily merged together without changing the configured cross section. The two different samples are weighted in such as way as to minimize the total statistical uncertainty.

To merge multiple samples, simply start the process by making a lone line.

INGROUP


Separate each set of samples with this delimiter. After the last set of samples to be merged, place the line.

ENDGROUP


For example, if you want to merge three types of samples in your plots, your MC Config would look like this.

INGROUP
example     type0_file0.root    0.5   LegendEntry    600
.           type0_file1.root    0.5   .              .
INGROUP
.           type1_file0.root    0.2   .              .
.           type1_file1.root    0.6   .              .
.           type1_file2.root    0.2   .              .
INGROUP
.           type2_file0.root    1.0   .              .
ENDGROUP


This will merge three different samples with a process with cross section of 1.0 in such a way as their total plotted cross section is 1.0 and their statistical uncertainty is minimized.

# Documentation

Todo:
Document the documentation subdirectory

# Miscellaneous

Other directories can of course be added by hand. There are certain ways to still source the old configuration files if you need it, and all of the command line and python tools are still available. Just be careful that if you change the configuration in a separate directory, the changes will be reflected in your miscellaneous directory. An analysis should be as tightly coupled as possible.