Filemerge is a utility for merging a large number of small HDFS files into smaller number of large files. Filemerge is intended for use by Hadoop operations engineers and map-reduce application developers. The structure of the code is simple. The actual merging is performed by a Pig script created at run time using user-supplied. FileMerge.net provides utilities to merge files. You can merge CSV files into one file. You can combine an HTML file and its images into one file.
To merge spreadsheets from MS Excel, Libre Office Calc, Google Sheets, etc., you must first save the sheet as a CSV (comma-separated values) file. In Excel and programs like it, click 'Save As' and save as a '.csv' file. In Google Sheets, click 'File > Download > Comma-separated values (.csv, current sheet).
Once you have the CSV files, click 'Choose File' to upload them. Click the checkbox next to the files you want to merge and then click the 'Start Merge' button.
Released:
Filemerge: Tool to merge small HDFS files
Project description
Filemerge
Filemerge is a utility for merging a large number of small HDFS files intosmaller number of large files. Filemerge is intended for use by Hadoop operationsengineers and map-reduce application developers.
The structure of the code is simple. The actual merging is performed by a Pigscript created at run time using user-supplied parameters. These parameterscontrol the set of files to merge. The utility consists of a single file,filemerge.py, that takes the input parameters and invokes the created pigscript. As such, pigcommand needs to be available and in path of theruntime user. The user specifies the input path, output path, topic, andfiles to be merged either as a year/month/day format or specific HDFS directoryor a list of HDFS directories in a file.
Installation and testing
Because the application code is small and self-contained, installation requiressimply cloning the repository.
Note that filemerge itself does not have any dependencies besides pigcommand-line. However, running the test suite locally requires installation ofthe test discovery and mocking packages. These dependencies are listed infilemerge/requirements.txt and can be installed as follows.
Finally, installation can be verified by running the test suite locally.
Developers can check test coverage running
from the project top-level directory.
Running the script
The script API
The full API of the script is available on commandline by typing
The help message is reproduced below for reference.
The arguments outside the square brackets are required and those in the squarebrackets are optional, but a minimum set of these arguments is needed to computethe set of directories to be merged. The acceptable option groups are following:
- Group 1
- year (-y)
- year (-y), month (-m)
- year (-y), month (-m), day (-d)
- Group 2
- HDFS directory (-D)
- Group 3
- file with a list of HDFS directories (-f)
- Group 4
- window with a start date (-w); files for all days between start date minuswindow to start date will be merged
- Group 5
- lookback with a start date (-l); files for a single day lookback days before thestart date will be merged
These option groups are designed to enable merging at the directory, day, month,or the year level. The -f offers ability to merge non-contiguous firectoryblocks. The -w and -l options allow merging of directories at periodicintervals using a sliding window.
One can further enhance the flexibility of these options by wrapping thepython call in a shell script and providing custom list of directories,non-contiguous months, shunking large directory lists into smaller parts etc.
Why all the flexibility?
The filemerge tool is written with operations and map-reduce applicationdevelopers in mind. Operations team will need periodic merges based on theretention policy and will typically use the tool with the -y,-m,-doptions. Map-reduce application developers might need to merge singledirectories or random directory groups and will use the -d and -foptions.
Basic usage: Merging all files in a directory
The most common usage pattern for filemerge is to merge all files in adirectory and produce one output file (in a different directory). To merge filesunders a specific directory, provide the basepath using the -i option andthe final directory name using the -D option. In the following invocationthe /path/to/clickstream is the base HDFS path and jan2016 is thesubdirectory that contains the files to be merged (in this case, for January2016). In other words, the full path to the files that will be merged is:/path/to/clickstream/jan2016
Example invocation for a full month merge
Following command invokes the script for merging February 2015 data of the‘clickstream’ directory in HDFS. This is the raw call to the filemerge python scriptand will initiate 28 map-reduce jobs.
Example invocation for a full year merge
Simply omit the month and day options and the merge wil be performed for thefull year. Following command invokes the script for merging the entire 2015 dataof the ‘clickstream’ directory with a 1 day chunk size. This will initiate 365map-reduce jobs.
Note that detecting files in time window (e.g. a certain month or a year)requires filemerge to assume certain directory naming conventions. Thisconvention is specified in filemerge/templates.py and can be user-defined.
Example invocation for a non-contiguous directory list
Filemerge Windows
To merge files under unrelated non-contiguous directories, list all the finaldirectory names in a file and pass the full file path to the -f option. Inthe invocation below, -i captures the common portion of the path to all thedirectories and the final directories are listed in the file.
Lets assume to that /local/filesystem/path/to/directory_list.txt containsthe following lines
In that case all files under /hdfs/path/to/clickstream/{d_20150225,d_20160309, d_20150728} will be merged. Note, that they wont be merged intothe same file. Rather, three different output directories, one for each directoryin listed in directory_list.txt, will be created.
Example invocation for a sliding time window
The following invocation the filemerge script will merge files in theclickstream directory for the last 20 days (not including today). The windowis datetime aware.
Example invocation for a sliding window daily merge
The following invocation the filemerge script will merge files in theclickstream topic for the day 20 days prior to today. The lookback isdatetime aware.
Multi-directory merge
Filemerge Macos
For multi-directory merges, filemerge.py can be called from a script thatprovides the list of directories and the merge frequency. The following wrapperscript shows how to merge 2015 files for a subset of directories. The script needs tobe present in the same directory as the filemerge.py script.
Merge for custom months
Merging for custom months is straightforward and is similar to above loopinglogic. Once again, the following script needs to be located in the same directoryas filemerge.py.
High-level pattern
The overarching pattern here is to realize that the unit of time for the mergelogic is a directory. As long as this is noted, the actual logic can be customizedin more ways than those shown above: simply write a wrapper shell script tocreate your variables and loop over them. These variables can be months,input directories, or output directories.
Release historyRelease notifications | RSS feed
0.0.3
File Merger
0.0.2
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size | File type | Python version | Upload date | Hashes |
---|---|---|---|---|
Filename, size filemerge-0.0.3.tar.gz (11.8 kB) | File type Source | Python version None | Upload date | Hashes |
Hashes for filemerge-0.0.3.tar.gz
Algorithm | Hash digest |
---|---|
SHA256 | 59d42958d48db5f76f2c6a5f64ebbe4a2437f50755a97e26f59da47d2b1145f0 |
MD5 | db7b1fc7f43bfb1f22fd6a47f5847470 |
BLAKE2-256 | d0744c86dae0ef8c0e7f9098eb621acff5b4817cede82b2ed2438383b34637fd |