Filemerge

  1. Filemerge Windows
  2. Filemerge Macos
  3. File Merger

Filemerge is a utility for merging a large number of small HDFS files into smaller number of large files. Filemerge is intended for use by Hadoop operations engineers and map-reduce application developers. The structure of the code is simple. The actual merging is performed by a Pig script created at run time using user-supplied. FileMerge.net provides utilities to merge files. You can merge CSV files into one file. You can combine an HTML file and its images into one file.

To merge spreadsheets from MS Excel, Libre Office Calc, Google Sheets, etc., you must first save the sheet as a CSV (comma-separated values) file. In Excel and programs like it, click 'Save As' and save as a '.csv' file. In Google Sheets, click 'File > Download > Comma-separated values (.csv, current sheet).

Once you have the CSV files, click 'Choose File' to upload them. Click the checkbox next to the files you want to merge and then click the 'Start Merge' button.

First, choose the files that you want to merge together. Once that is done, you can select the columns that you want in your new file.
Latest version

Released:

Filemerge mac

Filemerge: Tool to merge small HDFS files

Project description

Filemerge

Filemerge is a utility for merging a large number of small HDFS files intosmaller number of large files. Filemerge is intended for use by Hadoop operationsengineers and map-reduce application developers.

The structure of the code is simple. The actual merging is performed by a Pigscript created at run time using user-supplied parameters. These parameterscontrol the set of files to merge. The utility consists of a single file,filemerge.py, that takes the input parameters and invokes the created pigscript. As such, pigcommand needs to be available and in path of theruntime user. The user specifies the input path, output path, topic, andfiles to be merged either as a year/month/day format or specific HDFS directoryor a list of HDFS directories in a file.

Installation and testing

Because the application code is small and self-contained, installation requiressimply cloning the repository.

Note that filemerge itself does not have any dependencies besides pigcommand-line. However, running the test suite locally requires installation ofthe test discovery and mocking packages. These dependencies are listed infilemerge/requirements.txt and can be installed as follows.

Finally, installation can be verified by running the test suite locally.

Developers can check test coverage running

from the project top-level directory.

Running the script

The script API

The full API of the script is available on commandline by typing

The help message is reproduced below for reference.

The arguments outside the square brackets are required and those in the squarebrackets are optional, but a minimum set of these arguments is needed to computethe set of directories to be merged. The acceptable option groups are following:

  • Group 1
    • year (-y)
    • year (-y), month (-m)
    • year (-y), month (-m), day (-d)
  • Group 2
    • HDFS directory (-D)
  • Group 3
    • file with a list of HDFS directories (-f)
  • Group 4
    • window with a start date (-w); files for all days between start date minuswindow to start date will be merged
  • Group 5
    • lookback with a start date (-l); files for a single day lookback days before thestart date will be merged

These option groups are designed to enable merging at the directory, day, month,or the year level. The -f offers ability to merge non-contiguous firectoryblocks. The -w and -l options allow merging of directories at periodicintervals using a sliding window.

One can further enhance the flexibility of these options by wrapping thepython call in a shell script and providing custom list of directories,non-contiguous months, shunking large directory lists into smaller parts etc.

Why all the flexibility?

The filemerge tool is written with operations and map-reduce applicationdevelopers in mind. Operations team will need periodic merges based on theretention policy and will typically use the tool with the -y,-m,-doptions. Map-reduce application developers might need to merge singledirectories or random directory groups and will use the -d and -foptions.

Basic usage: Merging all files in a directory

The most common usage pattern for filemerge is to merge all files in adirectory and produce one output file (in a different directory). To merge filesunders a specific directory, provide the basepath using the -i option andthe final directory name using the -D option. In the following invocationthe /path/to/clickstream is the base HDFS path and jan2016 is thesubdirectory that contains the files to be merged (in this case, for January2016). In other words, the full path to the files that will be merged is:/path/to/clickstream/jan2016

Example invocation for a full month merge

Following command invokes the script for merging February 2015 data of the‘clickstream’ directory in HDFS. This is the raw call to the filemerge python scriptand will initiate 28 map-reduce jobs.

Example invocation for a full year merge

Simply omit the month and day options and the merge wil be performed for thefull year. Following command invokes the script for merging the entire 2015 dataof the ‘clickstream’ directory with a 1 day chunk size. This will initiate 365map-reduce jobs.

Note that detecting files in time window (e.g. a certain month or a year)requires filemerge to assume certain directory naming conventions. Thisconvention is specified in filemerge/templates.py and can be user-defined.

Example invocation for a non-contiguous directory list

Filemerge Windows

To merge files under unrelated non-contiguous directories, list all the finaldirectory names in a file and pass the full file path to the -f option. Inthe invocation below, -i captures the common portion of the path to all thedirectories and the final directories are listed in the file.

Lets assume to that /local/filesystem/path/to/directory_list.txt containsthe following lines

In that case all files under /hdfs/path/to/clickstream/{d_20150225,d_20160309, d_20150728} will be merged. Note, that they wont be merged intothe same file. Rather, three different output directories, one for each directoryin listed in directory_list.txt, will be created.

Example invocation for a sliding time window

The following invocation the filemerge script will merge files in theclickstream directory for the last 20 days (not including today). The windowis datetime aware.

Example invocation for a sliding window daily merge

The following invocation the filemerge script will merge files in theclickstream topic for the day 20 days prior to today. The lookback isdatetime aware.

Multi-directory merge

Filemerge Macos

For multi-directory merges, filemerge.py can be called from a script thatprovides the list of directories and the merge frequency. The following wrapperscript shows how to merge 2015 files for a subset of directories. The script needs tobe present in the same directory as the filemerge.py script.

Merge for custom months

Merging for custom months is straightforward and is similar to above loopinglogic. Once again, the following script needs to be located in the same directoryas filemerge.py.

High-level pattern

The overarching pattern here is to realize that the unit of time for the mergelogic is a directory. As long as this is noted, the actual logic can be customizedin more ways than those shown above: simply write a wrapper shell script tocreate your variables and loop over them. These variables can be months,input directories, or output directories.

Release historyRelease notifications | RSS feed

0.0.3

File Merger

0.0.2

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for filemerge, version 0.0.3
Filename, sizeFile typePython versionUpload dateHashes
Filename, size filemerge-0.0.3.tar.gz (11.8 kB) File type Source Python version None Upload dateHashes
Close

Hashes for filemerge-0.0.3.tar.gz

Hashes for filemerge-0.0.3.tar.gz
AlgorithmHash digest
SHA25659d42958d48db5f76f2c6a5f64ebbe4a2437f50755a97e26f59da47d2b1145f0
MD5db7b1fc7f43bfb1f22fd6a47f5847470
BLAKE2-256d0744c86dae0ef8c0e7f9098eb621acff5b4817cede82b2ed2438383b34637fd