水槽加载CSV文件擅长hdfs水槽

我已将Flume源configuration为Spooldirtypes。 我有很多CSV文件,.xl3和.xls ,我希望我的Flume代理将所有文件从spooldir加载到HDFS接收器。 然而,水槽代理返回exception

这是我的configurationflume来源:

agent.sources.s1.type = spooldir agent.sources.s1.spoolDir = /my-directory agent.sources.s1.basenameHeader = true agent.sources.batchSize = 10000 

和我的HDFS接收器:

 agent.sinks.sk1.type = hdfs agent.sinks.sk1.hdfs.path = hdfs://...:8020/user/importflume/%Y/%m/%d/%H agent.sinks.sk1.hdfs.filePrefix = %{basename} agent.sinks.sk1.hdfs.rollSize = 0 agent.sinks.sk1.hdfs.rollCount = 0 agent.sinks.sk1.hdfs.useLocalTimeStamp = true agent.sinks.sk1.hdfs.batchsize = 10000 agent.sinks.sk1.hdfs.fileType = DataStream agent.sinks.sk1.serializer = avro_event agent.sinks.sk1.serializer.compressionCodec = snappy 

您可以使用下面的configuration为假脱机目录。 只需在下面的configuration中给出本地文件系统和HDFS位置的path即可。

 #Flume Configuration Starts # Define a file channel called fileChannel on agent1 agent1.channels.fileChannel1_1.type = file # on linux FS agent1.channels.fileChannel1_1.capacity = 200000 agent1.channels.fileChannel1_1.transactionCapacity = 1000 # Define a source for agent1 agent1.sources.source1_1.type = spooldir # on linux FS #Spooldir in my case is /home/hadoop/Desktop/flume_sink agent1.sources.source1_1.spoolDir = 'path' agent1.sources.source1_1.fileHeader = false agent1.sources.source1_1.fileSuffix = .COMPLETED agent1.sinks.hdfs-sink1_1.type = hdfs #Sink is /flume_import under hdfs agent1.sinks.hdfs-sink1_1.hdfs.path = hdfs://'path' agent1.sinks.hdfs-sink1_1.hdfs.batchSize = 1000 agent1.sinks.hdfs-sink1_1.hdfs.rollSize = 268435456 agent1.sinks.hdfs-sink1_1.hdfs.rollInterval = 0 agent1.sinks.hdfs-sink1_1.hdfs.rollCount = 50000000 agent1.sinks.hdfs-sink1_1.hdfs.writeFormat=Text agent1.sinks.hdfs-sink1_1.hdfs.fileType = DataStream agent1.sources.source1_1.channels = fileChannel1_1 agent1.sinks.hdfs-sink1_1.channel = fileChannel1_1 agent1.sinks = hdfs-sink1_1 agent1.sources = source1_1 agent1.channels = fileChannel1_1 

您也可以参阅Flume spool dir上的这个博客了解更多信息。