Concatenation Tool

0. Content

1. Why Concatenation Tool

Large hadronic dataset skimming (or Montecarlo production) generates an huge number of file of hundreds Mb each. For example skimming of xbhdod dataset produces ~42000 "small" files.
The concatenation tools provides a easy way to concatenate theese files and store it on tape using SAM (Sequential Access via Metadata), a new Data Handling System for CDF.


2. Architecture of the Concatenation Tool

The tool is made up by a set of python script developped to carry out a specific tasks:

samConcDeclareList
samConcListGenerator
samConcDatasetSplitter
samConcCafSubmitter
samConcStore
samConcReadSegmentConfig

How does it works:

The tool is thought to be run iteratively as long as the list of file to be concatenated are empty. The figure shows the architecture of the tool in detail.


3. Where is the code

The code is public in cvs in PadovaSam/skim_tools.

fcdflnx2.fnal.gov:/cdf/home/delli > setup cdfsoft2 5.3.4 

fcdflnx2.fnal.gov:/cdf/home/delli > cvs checkout PadovaSam/skim_tools
cvs checkout: Updating PadovaSam/skim_tools
U PadovaSam/skim_tools/cleanDurableLocation
U PadovaSam/skim_tools/conf.py
U PadovaSam/skim_tools/samConcCafSubmitter
U PadovaSam/skim_tools/samConcDatasetSplitter
U PadovaSam/skim_tools/samConcDeclareList
U PadovaSam/skim_tools/samConcListGenerator
U PadovaSam/skim_tools/samConcReadSegmentConfig
U PadovaSam/skim_tools/samConcSequentialSubmitter
U PadovaSam/skim_tools/samConcStore
U PadovaSam/skim_tools/samDatasetConcate
U PadovaSam/skim_tools/samStoreCdfFile_v6
cvs checkout: Updating PadovaSam/skim_tools/scripts
U PadovaSam/skim_tools/scripts/mergeSubmit.py
U PadovaSam/skim_tools/scripts/testIfChildIsOnTape.py
U PadovaSam/skim_tools/scripts/testIfSkimmedFilesAreOK.py
cvs checkout: Updating PadovaSam/skim_tools/template
U PadovaSam/skim_tools/template/samConcCafCommand_template.txt
U PadovaSam/skim_tools/template/samConcCafSubmitter_template.csh
U PadovaSam/skim_tools/template/samConcStore_template.tcl
    

4. An example of Concatenation Tool usage

The commands below are only one iteration. The user must repeat this sequence of commands until the list of file to be concatenated is empty. Each iteration must be done after the end of CAF job.

fcdflnx2.fnal.gov:/cdf/home/delli/PadovaSam/skim_tools > ./samConcListGenerator --inputdataset=shdl02 --outputdataset=chdl02 --outputfile=concatenationList_shdl02.txt

Opening new file...Done
Getting files in output dataset...
[]
Done
Getting files in input dataset...
Done
Looping on files list...
Checking s2025b96.014bhdl0 children...
No children for file s2025b96.014bhdl0
Writing file s2025b96.014bhdl0 input files list...Done
Checking s2025c1e.05a0hdl0 children...
No children for file s2025c1e.05a0hdl0
Writing file s2025c1e.05a0hdl0 input files list...Done
Checking s2025b54.00d7hdl0 children...
No children for file s2025b54.00d7hdl0
Writing file s2025b54.00d7hdl0 input files list...Done
...
Checking s2027cf8.03c9hdl0 children...
No children for file s2027cf8.03c9hdl0
Writing file s2027cf8.03c9hdl0 input files list...Done
Done
Closing file...Done

fcdflnx2.fnal.gov:/cdf/home/delli/PadovaSam/skim_tools > ./samConcDatasetSplitter --inputfilelist=concatenationList_shdl02.txt --outputsplittedfilelist=concatenationSplittedList_shdl02.conf --outputdataset=chdl02

Getting information from file list and metadata
Processing 100/25577
Processing 200/25577
...
Processing 25400/25577
Processing 25500/25577
Done
Ordering files by runnumber
Plan file table:
[('138815', 's2021e3f.0047hdl0', '18717'),... ,('155795', 's2026093.01efhdl0', '6362'), ('155795', 's2026093.02b9hdl0', '4490')]
Done
Splitting files into segments
Processing 100/1000
Processing 200/1000
...
Processing 25400/25577
Processing 25500/25577
Splitted file table:
{ ... 1: [('s2021e3f.0047hdl0', 18717), ('s2021e44.0071hdl0', 20063), ('s2021e4f.0082hdl0', 21477), ('s2021e56.00d8hdl0', 19350), ('s2021e57.0066hdl0', 18694), ('s2021e6b.0028hdl0', 19487), ('s2021e6c.00dchdl0', 20275), ('s2021e6f.0033hdl0', 21336), ('s2021e70.009dhdl0', 1797), ('s2021e70.00cahdl0', 20122), ('s2021fc6.002ehdl0', 19178), ('s2021fc6.0221hdl0', 18775), ('s2021ff5.0006hdl0', 20842), ('s2021ff5.0162hdl0', 18596), ('s202200d.015fhdl0', 22114), ('s202200d.0255hdl0', 22437), ('s202200e.0085hdl0', 2198), ('s202200e.00b3hdl0', 20631), ('s202201f.0073hdl0', 18538), ('s2022022.006fhdl0', 21722), ('s2022024.009bhdl0', 20370), ('s2022048.0065hdl0', 20076), ('s2022049.004chdl0', 22482), ('s202204a.0125hdl0', 30670), ('s202220c.005ehdl0', 3420), ('s2022273.0064hdl0', 36856), ('s202235f.00c4hdl0', 37568), ('s2022375.0033hdl0', 35809), ('s20223b9.0007hdl0', 43603), ('s20223d1.007dhdl0', 30502), ('s20225b1.0001hdl0', 38827), ('s2022633.00c4hdl0', 10266), ('s20226ef.0096hdl0', 36746), ('s2022729.0033hdl0', 37949), ('s20228e3.0021hdl0', 2749), ('s20228e6.00d8hdl0', 40729), ('s2022904.00b6hdl0', 40358), ('s202291d.0148hdl0', 37407), ('s202291d.02a8hdl0', 39643), ('s2022932.0002hdl0', 38474), ('s2022935.0001hdl0', 41046)]}
Done
Writing splitted file list...
Done

fcdflnx2.fnal.gov:/cdf/home/delli/PadovaSam/skim_tools > ./samConcCafSubmitter --location=fcdfdata014.fnal.gov:/cdf/scratch/cdfdata/sam/temporary-location/ --splittedfilelist=./concatenationSplittedList_shdl02.conf
Creating toCAF directory...
Setting parameters in shell script template
Using cdfsoft2 development
Done
Tarring toCAF directory...
concatenationSplittedList_shdl02.conf
conf.py
samConcCafSubmitter.sh
samConcReadSegmentConfig
samConcStore
samStoreCdfFile_v6
template/
template/samConcStore_template.tcl
Done
CafSubmit command:

CafSubmit --tarFile=toCAF.tgz --outLocation='delli@pcdf6.pd.infn.it:/spool/delli/conc_out/concatenation_segment_$.tgz' --procType=test --dhaccess=None --group=italy --email=delli@fnal.gov --start=1 --end=151 --farm=caf ./samConcCafSubmitter.sh $

    

5. Reference Manual

The usage of the scripts are reported below.


samConcDeclareList

fcdflnx2.fnal.gov:/cdf/home/delli/PadovaSam/skim_tools > ./samConcDeclareList --help

 Minimal declare of a plan file list

Usage: samConcDeclareList <options>
 possible options are:
  --inputfilelist       - input file list to declare (file name, file size in Kb)
  --help                - this message
  --test                - test only
    

samConcListGenerator

fcdflnx2.fnal.gov:/cdf/home/delli/PadovaSam/skim_tools > ./samConcListGenerator --help

Loops on the files of the input dataset and fills an ascii list of files
to be concatenated looking at the parent/child relation.

Usage: samConcListGenerator <options>
 possible options are:
  --inputdataset        - input dataset to concatenate
  --inpufilelist        - inpu file list to concatenate
  --outputdataset       - new output dataset
  --outputfile          - the output file name
  --help                - this message
  --test                - test only
    

samConcDatasetSplitter

fcdflnx2.fnal.gov:/cdf/home/delli/PadovaSam/skim_tools > ./samConcDatasetSplitter --help

 Loops on the files of the input files list, do ordering by run number looking the metadata
 and create N SAM dataset containing files to be concatenated
 Info are witten in the output splitted file list

Usage: samConcDatasetSplitter <options>
 possible options are:
  --inputfilelist               - input file list
  --outputdataset               - new output dataset
  --outputfile                  - the output file name
  --ulimit                      - file size of the output files
  --createsplitteddatasets      - create also a set of splitted dataset for SAM access in concatenation
  --runnumberordering           - do order by run number
  --usemergesubmit              - use mergeSubmit.py
  --help                        - this message
  --test                        - test only
    

samConcCafSubmitter

fcdflnx2.fnal.gov:/cdf/home/delli/PadovaSam/skim_tools > ./samConcCafSubmitter --help

Produces a tgz file for concatenation porpose
to be submitted to CAF and the CafSubmit command

Usage: samConcCafSubmitter <options>
 possible options are:
  --splittedfilelist    - input splitted file list
  --location            - location of files to be concatenated (via rootd)
  --help                - this message
  --test                - test only
    

samConcStore

fcdflnx2.fnal.gov:/cdf/home/delli/PadovaSam/skim_tools > ./samConcStore --help

Perform concatenation a list of input files
and store in sam merged file using samStoreCdfFile

Usage: samConcStore <options>
 possible options are:
  --help                - this message
  --files               - list of input files (file1,file2,...,fileN)
  --location            - directory of the files
  --outfile             - full name of output file (before storing)
  --station             - sam station for SAM store
  --host                - hostname for the local SAM station
  --station_buffer      - path of the buffer on the SAM station
  --dataset             - CDF dataset assigned to the file
  --from_worker_node    - store from CAF Worker Node (otherwise from SAM station)
  --conc_only           - concatenate only, do not perform the store process
  --test                - test only, do not submit any process
    

samConcReadSegmentConfig

fcdflnx2.fnal.gov:/cdf/home/delli/PadovaSam/skim_tools > ./samConcReadSegmentConfig --help

 Read segment config and set environ variable

Usage: samReadSegmentConfig <options>
 possible options are:
  --segment             - segment to setenv to
  --splittedfilelist    - file to read
  --help                - this message
  --test                - test only
    


Francesco Delli Paoli
Last modified: Tue Dec 14 14:57:19 CST 2004