Large hadronic dataset skimming (or Montecarlo production) generates an huge number of file of hundreds Mb each. For example skimming of xbhdod dataset produces ~42000 "small" files.
The concatenation tools provides a easy way to concatenate theese files and store it on tape using SAM (Sequential Access via Metadata), a new Data Handling System for CDF.
The tool is made up by a set of python script developped to carry out a specific tasks:
samConcDeclareList
samConcListGenerator
samConcDatasetSplitter
samConcCafSubmitter
samConcStore
samConcReadSegmentConfig
How does it works:
The tool is thought to be run iteratively as long as the list of file to be concatenated are empty. The figure shows the architecture of the tool in detail.
The code is public in cvs in PadovaSam/skim_tools.
fcdflnx2.fnal.gov:/cdf/home/delli > setup cdfsoft2 5.3.4
fcdflnx2.fnal.gov:/cdf/home/delli > cvs checkout PadovaSam/skim_tools
cvs checkout: Updating PadovaSam/skim_tools
U PadovaSam/skim_tools/cleanDurableLocation
U PadovaSam/skim_tools/conf.py
U PadovaSam/skim_tools/samConcCafSubmitter
U PadovaSam/skim_tools/samConcDatasetSplitter
U PadovaSam/skim_tools/samConcDeclareList
U PadovaSam/skim_tools/samConcListGenerator
U PadovaSam/skim_tools/samConcReadSegmentConfig
U PadovaSam/skim_tools/samConcSequentialSubmitter
U PadovaSam/skim_tools/samConcStore
U PadovaSam/skim_tools/samDatasetConcate
U PadovaSam/skim_tools/samStoreCdfFile_v6
cvs checkout: Updating PadovaSam/skim_tools/scripts
U PadovaSam/skim_tools/scripts/mergeSubmit.py
U PadovaSam/skim_tools/scripts/testIfChildIsOnTape.py
U PadovaSam/skim_tools/scripts/testIfSkimmedFilesAreOK.py
cvs checkout: Updating PadovaSam/skim_tools/template
U PadovaSam/skim_tools/template/samConcCafCommand_template.txt
U PadovaSam/skim_tools/template/samConcCafSubmitter_template.csh
U PadovaSam/skim_tools/template/samConcStore_template.tcl
The commands below are only one iteration. The user must repeat this sequence of commands until the list of file to be concatenated is empty. Each iteration must be done after the end of CAF job.
fcdflnx2.fnal.gov:/cdf/home/delli/PadovaSam/skim_tools > ./samConcListGenerator --inputdataset=shdl02 --outputdataset=chdl02 --outputfile=concatenationList_shdl02.txt
Opening new file...Done
Getting files in output dataset...
[]
Done
Getting files in input dataset...
Done
Looping on files list...
Checking s2025b96.014bhdl0 children...
No children for file s2025b96.014bhdl0
Writing file s2025b96.014bhdl0 input files list...Done
Checking s2025c1e.05a0hdl0 children...
No children for file s2025c1e.05a0hdl0
Writing file s2025c1e.05a0hdl0 input files list...Done
Checking s2025b54.00d7hdl0 children...
No children for file s2025b54.00d7hdl0
Writing file s2025b54.00d7hdl0 input files list...Done
...
Checking s2027cf8.03c9hdl0 children...
No children for file s2027cf8.03c9hdl0
Writing file s2027cf8.03c9hdl0 input files list...Done
Done
Closing file...Done
fcdflnx2.fnal.gov:/cdf/home/delli/PadovaSam/skim_tools > ./samConcDatasetSplitter --inputfilelist=concatenationList_shdl02.txt --outputsplittedfilelist=concatenationSplittedList_shdl02.conf --outputdataset=chdl02
Getting information from file list and metadata
Processing 100/25577
Processing 200/25577
...
Processing 25400/25577
Processing 25500/25577
Done
Ordering files by runnumber
Plan file table:
[('138815', 's2021e3f.0047hdl0', '18717'),... ,('155795', 's2026093.01efhdl0', '6362'), ('155795', 's2026093.02b9hdl0', '4490')]
Done
Splitting files into segments
Processing 100/1000
Processing 200/1000
...
Processing 25400/25577
Processing 25500/25577
Splitted file table:
{ ... 1: [('s2021e3f.0047hdl0', 18717), ('s2021e44.0071hdl0', 20063), ('s2021e4f.0082hdl0', 21477), ('s2021e56.00d8hdl0', 19350), ('s2021e57.0066hdl0', 18694), ('s2021e6b.0028hdl0', 19487), ('s2021e6c.00dchdl0', 20275), ('s2021e6f.0033hdl0', 21336), ('s2021e70.009dhdl0', 1797), ('s2021e70.00cahdl0', 20122), ('s2021fc6.002ehdl0', 19178), ('s2021fc6.0221hdl0', 18775), ('s2021ff5.0006hdl0', 20842), ('s2021ff5.0162hdl0', 18596), ('s202200d.015fhdl0', 22114), ('s202200d.0255hdl0', 22437), ('s202200e.0085hdl0', 2198), ('s202200e.00b3hdl0', 20631), ('s202201f.0073hdl0', 18538), ('s2022022.006fhdl0', 21722), ('s2022024.009bhdl0', 20370), ('s2022048.0065hdl0', 20076), ('s2022049.004chdl0', 22482), ('s202204a.0125hdl0', 30670), ('s202220c.005ehdl0', 3420), ('s2022273.0064hdl0', 36856), ('s202235f.00c4hdl0', 37568), ('s2022375.0033hdl0', 35809), ('s20223b9.0007hdl0', 43603), ('s20223d1.007dhdl0', 30502), ('s20225b1.0001hdl0', 38827), ('s2022633.00c4hdl0', 10266), ('s20226ef.0096hdl0', 36746), ('s2022729.0033hdl0', 37949), ('s20228e3.0021hdl0', 2749), ('s20228e6.00d8hdl0', 40729), ('s2022904.00b6hdl0', 40358), ('s202291d.0148hdl0', 37407), ('s202291d.02a8hdl0', 39643), ('s2022932.0002hdl0', 38474), ('s2022935.0001hdl0', 41046)]}
Done
Writing splitted file list...
Done
fcdflnx2.fnal.gov:/cdf/home/delli/PadovaSam/skim_tools > ./samConcCafSubmitter --location=fcdfdata014.fnal.gov:/cdf/scratch/cdfdata/sam/temporary-location/ --splittedfilelist=./concatenationSplittedList_shdl02.conf
Creating toCAF directory...
Setting parameters in shell script template
Using cdfsoft2 development
Done
Tarring toCAF directory...
concatenationSplittedList_shdl02.conf
conf.py
samConcCafSubmitter.sh
samConcReadSegmentConfig
samConcStore
samStoreCdfFile_v6
template/
template/samConcStore_template.tcl
Done
CafSubmit command:
CafSubmit --tarFile=toCAF.tgz --outLocation='delli@pcdf6.pd.infn.it:/spool/delli/conc_out/concatenation_segment_$.tgz' --procType=test --dhaccess=None --group=italy --email=delli@fnal.gov --start=1 --end=151 --farm=caf ./samConcCafSubmitter.sh $
The usage of the scripts are reported below.
samConcDeclareList
fcdflnx2.fnal.gov:/cdf/home/delli/PadovaSam/skim_tools > ./samConcDeclareList --help
Minimal declare of a plan file list
Usage: samConcDeclareList <options>
possible options are:
--inputfilelist - input file list to declare (file name, file size in Kb)
--help - this message
--test - test only
samConcListGenerator
fcdflnx2.fnal.gov:/cdf/home/delli/PadovaSam/skim_tools > ./samConcListGenerator --help
Loops on the files of the input dataset and fills an ascii list of files
to be concatenated looking at the parent/child relation.
Usage: samConcListGenerator <options>
possible options are:
--inputdataset - input dataset to concatenate
--inpufilelist - inpu file list to concatenate
--outputdataset - new output dataset
--outputfile - the output file name
--help - this message
--test - test only
samConcDatasetSplitter
fcdflnx2.fnal.gov:/cdf/home/delli/PadovaSam/skim_tools > ./samConcDatasetSplitter --help
Loops on the files of the input files list, do ordering by run number looking the metadata
and create N SAM dataset containing files to be concatenated
Info are witten in the output splitted file list
Usage: samConcDatasetSplitter <options>
possible options are:
--inputfilelist - input file list
--outputdataset - new output dataset
--outputfile - the output file name
--ulimit - file size of the output files
--createsplitteddatasets - create also a set of splitted dataset for SAM access in concatenation
--runnumberordering - do order by run number
--usemergesubmit - use mergeSubmit.py
--help - this message
--test - test only
samConcCafSubmitter
fcdflnx2.fnal.gov:/cdf/home/delli/PadovaSam/skim_tools > ./samConcCafSubmitter --help
Produces a tgz file for concatenation porpose
to be submitted to CAF and the CafSubmit command
Usage: samConcCafSubmitter <options>
possible options are:
--splittedfilelist - input splitted file list
--location - location of files to be concatenated (via rootd)
--help - this message
--test - test only
samConcStore
fcdflnx2.fnal.gov:/cdf/home/delli/PadovaSam/skim_tools > ./samConcStore --help
Perform concatenation a list of input files
and store in sam merged file using samStoreCdfFile
Usage: samConcStore <options>
possible options are:
--help - this message
--files - list of input files (file1,file2,...,fileN)
--location - directory of the files
--outfile - full name of output file (before storing)
--station - sam station for SAM store
--host - hostname for the local SAM station
--station_buffer - path of the buffer on the SAM station
--dataset - CDF dataset assigned to the file
--from_worker_node - store from CAF Worker Node (otherwise from SAM station)
--conc_only - concatenate only, do not perform the store process
--test - test only, do not submit any process
samConcReadSegmentConfig
fcdflnx2.fnal.gov:/cdf/home/delli/PadovaSam/skim_tools > ./samConcReadSegmentConfig --help
Read segment config and set environ variable
Usage: samReadSegmentConfig <options>
possible options are:
--segment - segment to setenv to
--splittedfilelist - file to read
--help - this message
--test - test only