在我的HDFS上,我有一堆gzip文件,我想要解压缩到正常格式.有没有这样做的API?或者我怎么能写一个函数来做到这一点?
我不想使用任何命令行工具; 相反,我想通过编写Java代码来完成这项任务.
您需要一个CompressionCodec
解压缩文件.gzip的实现是GzipCodec
.您可以CompressedInputStream
通过编解码器获得一个简单的IO结果.这样的事情:说你有一个文件file.gz
//path of file String uri = "/uri/to/file.gz"; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(uri), conf); Path inputPath = new Path(uri); CompressionCodecFactory factory = new CompressionCodecFactory(conf); // the correct codec will be discovered by the extension of the file CompressionCodec codec = factory.getCodec(inputPath); if (codec == null) { System.err.println("No codec found for " + uri); System.exit(1); } // remove the .gz extension String outputUri = CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension()); InputStream is = codec.createInputStream(fs.open(inputPath)); OutputStream out = fs.create(new Path(outputUri)); IOUtils.copyBytes(is, out, conf); // close streams
UPDATE
如果你需要获取目录中的所有文件,你应该得到FileStatus
类似的东西
FileSystem fs = FileSystem.get(new Configuration()); FileStatus[] statuses = fs.listStatus(new Path("hdfs/path/to/dir"));
然后循环
for (FileStatus status: statuses) { CompressionCodec codec = factory.getCodec(status.getPath()); ... InputStream is = codec.createInputStream(fs.open(status.getPath()); ... }