Tutorial for newcomers
setup of framework
Central and core package of the analysis framework is SFrame.
Documentation wiki: http://sourceforge.net/apps/mediawiki/sframe/index.php?title=Main_Page
It is worthwhile reading the pages as it explains the idea and structure of the package.
Setting up SFrame
From the SFrame svn repository one can see which one the latest tagged version of SFrame is. Currently it is SFrame-03-04-10.
To get started you should create a new working directory and check-out SFrame to it:
mkdir Analysis cd Analysis svn co https://sframe.svn.sourceforge.net/svnroot/sframe/SFrame/tags/SFrame-03-04-10 SFrame
In order to be able to run SFrame you need a current version of ROOT. Either set up your own version or use the one that's available at DESY:
ini atlasfw cd SFrame source setup.sh
These steps you need to repeat every time you want to work with SFrame. To check if everything's working try to compile:
make
SFrame has some use examples in the user dir. They will also demonstrate the PROOF capabilities of SFrame, which we won't use in the following as PROOF showed to be unstable in connection with dCache.
Analysis with SFrame
SFrame is basically just an event loop, which you might know from your MakeClass exercises. SFrame, however, is much smarter and allows you to write a much cleaner and modular analysis. As mentioned before, SFrame also has PROOF capabilities. You can find out more on the SFrame-PROOF page.
In general, your analysis can consist of several cycles. You might, for instance, run a basic selection that will not change for the rest of your analysis on all available datasets. Currently, we don't use this in our analyses, but the examples in SFrame/user will show you an example use case. Your analysis cycle consists of the following building blocks (more details):
virtual void BeginCycle() throw( SError ): Function called once at the beginning of executing the cycle, before the first InputData block is "opened". You can use it to perform an initial configuration of the cycle. For instance if the cycle needs to read some local file for some information (good data ranges for example), that can be done best here. The function is always executed in the sframe_main process, even when running in PROOF mode.
virtual void EndCycle() throw( SError ): Function called once at the end of the cycle execution. Any finalisation steps should be done here. (Closure of some helper files opened by the user code for instance.) This function is again called in the sframe_main process.
PROOF only: virtual void BeginMasterInputData( const SInputData& ) throw( SError ): Function called before processing each InputData block, on the master PROOF node. For more information about the PROOF functionality, have a look at the page SFrame-PROOF.
PROOF only: virtual void EndMasterInputData( const SInputData& ) throw( SError ): Function called after being finished processing one InputData block, on the master PROOF node. Notice that the PROOF master node receives the full statistics information from the InputData at this point. So this is a good place to print some summaries, do some final calculations on the created histograms (for instance fitting them), etc. For more information about the PROOF functionality, have a look at the page SFrame-PROOF.
virtual void BeginInputData( const SInputData& ) throw( SError ): Function called on the PROOF worker nodes once before processing each of the input data types. SFrame creates one output file per input data type. If you need to initialise output objects (histograms, etc.) before the event-by-event execution, you should do that here. Also the declaration of the output variables has to be done here.
virtual void EndInputData( const SInputData& ) throw( SError ): Function called last on the PROOF worker nodes before the processing of the input data type is finished. Notice that in this function the code can only access the statistics processed by the one worker node, so most post-processing of the output objects is better placed in the EndMasterInputData(...) function.
virtual void BeginInputFile( const SInputData& ) throw( SError ): For each new input file the user has to connect his input variables. (More on this later.) This has to be performed in this function.
virtual void ExecuteEvent( const SInputData&, Double_t ) throw( SError ): This is the main analysis function that is called for each event. It receives the weight of the event, as it is calculated by the framework from the luminosities and generator cuts defined in the XML configuration.
If you paid attention while reading through the different function, you might wonder about the luminosity calculation. We will come back to that later.
Creating your own package
Now let's start doing some work. Go back to your analysis directory and type the following:
sframe_new_package.sh MyTestPackage
This will create a new directory MyTestPackage and put some basic files such as a MakeFile into it. The package would already compile now, but wouldn't do anything.
Creating your own cycle
To have something runnable, you need to create a cycle.
cd MyTestPackage sframe_create_cycle.py -n MyTestAnalysis
See if it compiles:
make
Fiddling around with xml settings
Unfortunately, your cycle doesn't run out-of-the-box. You need to put this JobConfig.dtd file into the config dir of your package (by copying SFrame/usr/config/JobConfig.dtd to the config directory of your package) and add the following line after the first line of the MyTestAnalysis_config.xml so that SFrame/the xml parser knows how to treat the tags (don't worry about the details).
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE JobConfiguration PUBLIC "" "JobConfig.dtd" []> <JobConfiguration JobName="MyTestAnalysisJob" OutputLevel="INFO">
We will now step by step get our package running which will help you later with debugging, when you're alone in your office. Let's try to run again:
cd config sframe_main MyTestAnalysis_config.xml
SFrame now complains that element In doesn't contain attribute lumi. This is due to the fact that each input file in SFrame has an associated luminosity. How that is obtained will be shown later. For now, change the following line
<In FileName="YourInputFileComesHere"/>
to
<In FileName="/afs/ifh.de/group/atlas/scratch/topdata/16.0.3.3.3-Production-TauFix/user.clange.STop160333.mc10_7TeV.105200.T1_McAtNlo_Jimmy.merge.AOD.e598_s933_s946_r1831_r1700.20110204.110204151127/user.clange.003195.SingleTop._00001.root" Lumi="1.0"/>
Now we have an input file defined. Try to run again. For now we set the lumi to 1.0.
You get an error that the library isn't found. In your configuration you can include external libraries and have to make them known in the xml file. The same thing applies to the package itself. You need to make it known to the cycle. Adjust the xml file the following way:
<Library Name="libMyTestPackage"/>
And as we don't have any user configuration set up so far comment the UserConfig.
<!-- <UserConfig> <Item Name="NameOfUserProperty" Value="ValueOfUserProperty"/> </UserConfig> -->
Now we should be at a stage where already a lot of things are working. Next thing SFrame complains about is that it doesn't have an input TTree name. SFrame needs to know about every InputTree you're using in your cycle. In the InputData section add the following line:
<InputTree Name="RecoTree" />
Now start running again and here we go: We have our first running Cycle!
Have a look at the output:
( INFO ) SCycleController : Initializing ( INFO ) SCycleController : Deleting all analysis cycle algorithms from memory ( INFO ) SCycleController : read xml file: 'MyTestAnalysis_config.xml' ( INFO ) SCycleController : Created cycle 'MyTestAnalysis' ( INFO ) MyTestAnalysis : Initializing from configuration ( INFO ) MyTestAnalysis : Reading SInputData: Data1 - Reco ( INFO ) SCycleConfig : =========================================================== ( INFO ) SCycleConfig : Cycle configuration ( INFO ) SCycleConfig : - Running mode: LOCAL ( INFO ) SCycleConfig : - Target luminosity: 1 ( INFO ) SCycleConfig : - Output directory: ./ ( INFO ) SCycleConfig : - Post-fix: ( INFO ) SInputData : --------------------------------------------------------- ( INFO ) SInputData : Type : Data1 ( INFO ) SInputData : Version : Reco ( INFO ) SInputData : Total luminosity : 1pb-1 ( INFO ) SInputData : NEventsMax : -1 ( INFO ) SInputData : NEventsSkip : 0 ( INFO ) SInputData : Cacheable : No ( INFO ) SInputData : Skip validation : No ( INFO ) SInputData : Input File : '/Users/clange/Analyse/user.clange.STop160333.mc10_7TeV.105200.T1_McAtNlo_Jimmy.merge.AOD.e598_s933_s946_r1831_r1700.20110204.110204151127/user.clange.003195.SingleTop._00001.root' (file) | '1' (lumi) ( INFO ) SInputData : Tree : 'RecoTree' (name) | 'Flat input tree' (type) ( INFO ) SInputData : --------------------------------------------------------- ( INFO ) SCycleConfig : =========================================================== ( INFO ) SCycleController : Job 'MyTestAnalysisJob' configured ( INFO ) SCycleController : Time needed for initialisation: 0.019 s ( INFO ) SCycleController : Entering ExecuteAllCycles() ( INFO ) SInputData : Input type "Data1" version "Reco" : 4995 events ( INFO ) SCycleController : Executing Cycle #0 ('MyTestAnalysis') locally ( INFO ) SCycleController : Processing input data type: Data1 version: Reco ( INFO ) MyTestAnalysis : Initialised InputData "Data1" (Version:Reco) on worker node ( INFO ) MyTestAnalysis : Processing entry: 999 (999 / 4995 events processed so far) ( INFO ) MyTestAnalysis : Processing entry: 1999 (1999 / 4995 events processed so far) ( INFO ) MyTestAnalysis : Processing entry: 2999 (2999 / 4995 events processed so far) ( INFO ) MyTestAnalysis : Processing entry: 3999 (3999 / 4995 events processed so far) ( INFO ) MyTestAnalysis : Terminated InputData "Data1" (Version:Reco) on worker node ( INFO ) SCycleController : Writing output of "MyTestAnalysis" to: ./MyTestAnalysis.Data1.Reco.root ( INFO ) SCycleController : Overall cycle statistics: ( INFO ) SCycleController : 4995 Events - Real time 0.37 s - 13643 Hz | CPU time 0.28 s - 17839 Hz
You can see that SFrame runs over all 4995 events in the ntuple and spits out a file called MyTestAnalysis.Data1.Reco.root. As we didn't do anything, it is empty.
Here's a small exercise for you:
- Try to skip the first 500 events and limit the analysis to 1000 events.
Change the output file name to MyTestAnalysis.FirstTry.root
- Hint: look at the SFrame output to find out what needs to be set.
Code manipulation
Now that we have a running cycle, let's add same salt and pepper to it. First of all we need to connect a few branches. This is what one would naturally do for each input file, i.e. in BeginInputFile. Open src/MyTestAnalysis.cxx. Add the following lines:
// // Connect the input variables: // ConnectVariable( m_recoTreeName.c_str(), (m_electronPrefix + "_N").c_str(), m_Electron_N ); ConnectVariable( m_recoTreeName.c_str(), (m_electronPrefix + "_pt").c_str(), m_Electron_pt ); ConnectVariable( m_recoTreeName.c_str(), (m_electronPrefix + "_eta").c_str(), m_Electron_eta ); ConnectVariable( m_recoTreeName.c_str(), (m_electronPrefix + "_phi").c_str(), m_Electron_phi ); ConnectVariable( m_recoTreeName.c_str(), (m_electronPrefix + "_e").c_str(), m_Electron_e );
This command works very similar to ConnectBranch. All those member variables m_* need to be declared in the header file of MyTestAnalysis. In the case of the ntuple we're using now they are of type std::vector<float>*, except for m_Electron_n, which is of type int. You always have to make sure you're using the right data type when connecting branches. You can check the type of each variable if you right-click on the desired branch in ROOT TBrowser and choose Inspect.
While you're editing the header file, also add a string names m_recoTreeName. Your header file should have the following content now:
private: // // Put all your private variables here // std::string m_recoTreeName; std::string m_electronPrefix; // branch names int m_Electron_N; std::vector<float>* m_Electron_pt; std::vector<float>* m_Electron_eta; std::vector<float>* m_Electron_phi; std::vector<float>* m_Electron_e;
In a minute we're going to do something smart about the m_recoTreeName, for now we just set it by hand. As this is a property that needs to be set for the whole cycle, add the following line in the constructor (MyTestAnalysis::MyTestAnalysis()
- : SCycleBase()):
m_recoTreeName = "RecoTree"; m_electronPrefix = "Electron";
As we added a rather complicated data structure of std::vector<float> we would need to make it known to ROOT. This has, however, already been taken care of by one of the built-in SFrame classes. We just need to make it known to the Cycle. In the xml file add the following line:
<Library Name="libGenVector" />
If you wanted to use more complicated data structures such as std::vector< std::vector< < float > > you would need to define them in the src/*LinkDef.h file the following way:
#pragma link C++ class std::vector< std::vector< < float > >+;
If you now compile and run again, the output won't change. To see that something's happening, you need to change the OutputLevel to DEBUG (in the xml). You will then see that the branches are connected.
Output Level intermezzo
While we're changing the output level, let's try the different available output levels. In BeginCycle add the following lines:
// // Test how various printed lines look like: // m_logger << VERBOSE << "This is a VERBOSE line" << SLogger::endmsg; m_logger << DEBUG << "This is a DEBUG line" << SLogger::endmsg; m_logger << INFO << "This is an INFO line" << SLogger::endmsg; m_logger << WARNING << "This is a WARNING line" << SLogger::endmsg; m_logger << ERROR << "This is an ERROR line" << SLogger::endmsg; m_logger << FATAL << "This is a FATAL line" << SLogger::endmsg; m_logger << ALWAYS << "This is an ALWAYS line" << SLogger::endmsg;
As you can see the SFrame Logger allows you to configure your output in a very comfortable way. You should try to use the logger instead of cout, because it will save you recompilation when you're debugging (and also in many other cases).
Histogram booking and filling
Back to our analysis cycle. We now want to make the first step and fill a histogram. SFrame allows you to create and fill a histogram in one line without the need of declaring it. To fill something useful, let's loop over all electrons for each event and fill the pt into a histogram (in ExecuteEvent):
for( Int_t i = 0; i < m_Electron_N; ++i ) { // Fill the example histogram: Book( TH1F( "El_pt_hist", "Electron p_{T} [MeV]", 100, 0.0, 150000.0 ) )->Fill( (*m_Electron_pt)[i] ); }
If you fill a histogram in two places, it's better to book it in BeginInputData:
Book( TH1F( "El_eta_hist", "Electron #eta", 20, -5.0, 5.0 ) );
and then fill it in ExecuteEvent (in the loop):
Hist( "El_eta_hist" )->Fill( (*m_Electron_eta)[i] );
Writing output trees
For your secondary cycle or for multivariate analyses you might need an output tree with output branches. In your header file you need to declare the variables:
// // The output variables // std::vector< float > m_o_El_pt;
In BeginInputData you define the link between variable and branch output:
// // Declare the output variables: // DeclareVariable( m_o_El_pt, "El_pt" );
In ExecuteEvent you should first clear the vector, then fill it:
m_o_El_pt.clear(); ... m_o_El_pt.push_back( (*m_Electron_pt)[i] );
The code will compile, but it won't run, because you need to define an output tree name in the xml:
<OutputTree Name="OutTree" />
Now run and have a look at the output ROOT file.
Making your code xml-configurable
One of the nice features of SFrame is the configurability via xml that makes your code very flexible. We've set the RecoTreeName by hand in one of the previous exercises. Now we will make it configurable. Replace the string assignment by:
// // Declare the properties of the cycle: // DeclareProperty( "ElectronPrefix", m_electronPrefix = "Electron" );
The default value is not needed, but can be very useful. We can now use the UserConfig section in the xml and for example change the ElectronPrefix to Jet:
<UserConfig> <Item Name="ElectronPrefix" Value="Jet"/> </UserConfig>
Compile, run and you will see very different histograms/branches in the output file.
Now you know the most important basics of SFrame and we can continue to more specific stuff. SFrame has, of course, a lot more functionality such as:
but as we won't need that for this tutorial, you can have a look yourself some other time.
documentation
Besides the general SFrame documentation there are a couple of pages concerning documentation in this wiki on the AnalysisFramework. After having gone through this tutorial it is also YOUR responsibility to keep documentation up-to-date.
The most important page is the list of AvailablePackages. This page describes how to check-out every package that is available and possibly additional commands that are needed.
In addition, we have set up an automatic code documentation page using doxygen with weekly updates. This is still under construction but mostly working and will be helpful when you write your analysis.
Take your time and browse through the documentation (atlas/insider).
functionality of central/core packages
The analysis framework consists of a few central packages that save you a lot of coding and make collaboration with other group members easy and comfortable.
D3PDVariables
The D3PDVariables package provides a wrapper to the ntuple variables and at the same time provides the analyser with Particle classes and some comfort functionality. Main purpose:
- automatically connect branches of standard physics objects with desired level of detail
- additional particle class for easier looping that provides some additional useful functions
repository: https://svnweb.cern.ch/trac/desyatfw/browser/CommonAnalysis/Common/D3PDVariables
Go into your analysis directory and get the package. We will put it into the Common directory, because it's used by all analyses:
mkdir Common cd Common svn co svn+ssh://svn.cern.ch/reps/desyatfw/CommonAnalysis/Common/D3PDVariables/trunk D3PDVariables
The package contains a python script to automatically generate the D3PDVariables from a few tab-separated text files. To create the variables issue from the package directory:
python scripts/CodeIt.py
The configuration files and code skeletons are located in scripts/Meta. It currently works for Electron, Muon, Jet, TrackParticle and Vertex. The structure of each text file is as follows:
detaillevel \t type \t variable name
Let's now integrate this package in our test analysis. As we already have some electron properties implemented, let's give it a try with muons. First of all, compile the D3PDVariables package. This will copy the package shared libraries to the SFrame directory.
In MyTestAnalysis header file add the following includes:
// External include(s): #include "../../Common/D3PDVariables/include/MuonD3PDObject.h" #include "../../Common/D3PDVariables/include/Muon.h"
and make a forward declaration:
namespace DESY { class Muon; }
In the private section add the following:
// // Input variable objects: // D3PD::MuonD3PDObject m_muon; ///< muon container
Now we need to initialise the object correctly in the cxx file. Extend the constructor:
MyTestAnalysis::MyTestAnalysis() : SCycleBase(), m_muon( this ) {
In BeginInputFile we can now connect the variables. Instead of several lines as before for electrons you just need one line:
// // Connect all the D3PDObjects // m_muon.ConnectVariables( m_recoTreeName.c_str(), 0, "Muon_" );
The number 0 denotes the detail level. The less variables you connect the faster your analysis runs. The detail levels are set in D3PDVariables.
We can now loop over all muons in the D3PD (MuonD3PDObject), create a Muon object, get the TLorentzVector and fill histograms:
for( Int_t i = 0; i < m_muon.N; ++i ) { // set muon object DESY::Muon mymu( &m_muon, i ); Book( TH1F( "Mu_pt_hist", "Muon p_{T} [MeV]", 100, 0.0, 150000.0 ) )->Fill( mymu.pt() ); TLorentzVector* mu_tlv = mymu.getTLV(); Book( TH1F( "Mu_rap_hist", "Muon rapidity", 20, -5.0, 5.0 ) )->Fill( mu_tlv->Rapidity() ); }
Now we need to make the D3PDVariables package known to our cycle:
<Library Name="libD3PDVariables" />
Compile and run. If your OutputLevel is still set to debug you can see which variables are connected.
Exercise: Open the ntuple, pick one of the muon variables that is not yet defined in D3PDVariables and add it at detail level 4 in the text file. Mind: You can use the cxx and h files for testing, but the except for D3PDVariable/src/D3PDObjectsNames.cxx you should only edit the txt-files! Run the python script. Now there is one more thing you need to check. Open D3PDVariable/src/D3PDObjectsNames.cxx. Here you see the actual wrapping of object names to ntuple variable names. If the variable you picked is not yet there (in SingleTopDPDMaker section), add it there. Compile D3PDVariables. Add a histogram into which you fill the new variable. Compile MyTestPackage. Run MyTestAnalysis. It will crash, as you didn't adjust the detail level, take note of the error message, as this crash is common and you should be aware of it. Now adjust the detail level. Note: In D3PDObjectsNames.cxx you will have seen variable wrapping for other ntuple names. This is the place where the wrapping for TopD3PDs would go in. That will basically be the only change you need. Even if you started your analysis on SingleTopD3PDs, you can easily switch by just this little change!
SelectionTools
Package allowing for object selection and overlap removal
- all cuts are tunable via xml
- automatic creation of validation plots
repository: https://svnweb.cern.ch/trac/desyatfw/browser/CommonAnalysis/Common/SelectionTools
This package should also go into Common.
svn co svn+ssh://svn.cern.ch/reps/desyatfw/CommonAnalysis/Common/SelectionTools/trunk SelectionTools cd SelectionTools make
Let's now add a muon selector to the analysis. In the header file add the include:
#include "../../Common/SelectionTools/include/MuonSelectorTool.h"
and in private section add an instance of MuonSelectorTool:
// // The selector tools // MuonSelectorTool m_muonSelector; ///< selector for muon candidates
Extend the constructor as you did it before with m_muon:
MyTestAnalysis::MyTestAnalysis() : SCycleBase(), m_muon( this ), m_muonSelector( this, "MuonSelector" ) {
The string "MuonSelector" is important in case you want to change one of the cuts in your xml. You can of course have several instances of MuonSelector (with different names) and implement different selections.
The selectors need to be initialised in BeginInputData, which also needs to be extended with "id":
void MyTestAnalysis::BeginInputData( const SInputData& id ) throw( SError ) { // // Initialize the tool(s): // m_muonSelector.BeginInputData( id );
This will also print out the selection. Also remember to add the library to the xml. Let's now have the MuonSelector do something. We can just pass the Muon object in our loop to it:
if( m_muonSelector.IsPassed( mymu ) ) { mymu.flagAsGood(); Book( TH1F( "Mu_pt_sel_hist", "Selected Muon p_{T} [MeV]", 100, 0.0, 150000.0 ) )->Fill( mymu.pt() ); }
If you compile and run now, your code will crash. The reason for that is a little subtle. You will encounter errors like these quite often. One reason might be that you didn't compile all packages consistently. To solve this do
make distclean && make
in each package starting from the most basic one. This is, however, not the problem here. SelectionTools use SFrame's own slim histograms SH1, but our cycle doesn't. To make them known to our cycle, we need to add the following line to the xml:
<Library Name="libSFramePlugIns" />
Now everything should work. The selection tools will also create some validation hists. Have a look at them!
Exercise:
- Change the pt cut to 30 GeV using only the xml file.
Add a new cut to the MuonSelectorTool called maxpt that limits the selected muons' pt to 100 GeV. Try also to add the validation histograms.
DesyUtilities
The DesyUtilities package is the swiss-army knife of the analysis. Currently, it contains most tools needed for systematic variations and also contains a lot of utility functions.
We're just going to focus on one of them: SCycleBaseDesy
SCycleBaseDesy extends the SCycleBase with some extra functionality. It allows you to define your histograms in a text file. Instead of using string look-ups, it uses ints as identifiers, which is a lot faster. To make use of this, you have to make changes in several places in your code. Have a look at the https://svnweb.cern.ch/trac/desyatfw/browser/CommonAnalysis/Top/GoDesy/trunk, which uses the new histogramming style. Basic changes:
extend MakeFile
- replace SCycleBase by SCycleBaseDesy
create common.par file on config.
make sure you call BookHistograms()
- also make lib known in xml It's left up to you whether you would like to take advantage of this feature. If you have a rather large analysis, it should be a lot faster in most cases.
some more hints
inclusion of external packages
We've seen that inclusion of external packages can be difficult. Make sure you adjust the LinkDef.h file if needed. Also try to avoid mixing packages provided by other people with packages such as DesyUtilities. Try to check-out those packages to an extra directory so that one can easily update them.
writing your own package
When writing your own package, as we've done above, try to have as little dependencies as possible, because you can easily lose overview. Sometimes it's worth starting a new package from scratch instead of copying an old one.
coding rules
As there are several new people joining the group now and not all people talk to each other on a frequent basis, you should make sure that you only check-in running code and if you need to change the interface (which you should avoid), make this known to everyone! I will add you to the atlas-analysis-fw@desy.de list after this meeting.
There is a SFrame users mailing list: atlas-sframe-users@cern.ch - subscribe to it via hypernews. Attile provides very good support and in case of problems it often helps to search the archive.
Please make sure you comment your code properly. As we use doxygen, try to follow the standards.