hadoop实战--单词串的统计
1.运行简单计数程序
首先准备两个文本文件,在命令行中输入执行命令:
echo "hello hadoop word count">/tmp/test_file1.txt
echo "hello hadoop,I'm a vegetable bird">/tmp/test_file2.txt
将两个文件复制到dfs里,执行命令
bin/hadoop dfs -mkdir test-in     (创建文件夹test-in)
bin/hadoop dfs -copyFromLocal /tmp/test*.txt test-in    (复制两文件到test-in)
bin/hadoop dfs -ls test-in           (查看是否复制成功)显示如下列表:
  1. Found 2 items
  2. -rw-r--r-- 1 hadoop supergroup 24 2011-01-21 18:40 /user/hadoop/test-in/test_file1.txt
  3. -rw-r--r-- 1 hadoop supergroup 34 2011-01-21 18:40 /user/hadoop/test-in/test_file2.txt
注:这里的test-in其实是HDFS路径下的目录,绝对路径为“hdfs://localhost:9000/user/hadoop/test-in”
运行示例,执行如下命令
bin/hadoop jar hadoop-mapred-examples-0.21.0.jar wordcount test-in test-out  (将生成结果输出到test-out)屏幕显示:
  1. 11/01/21 18:50:16 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
  2. 11/01/21 18:50:17 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
  3. 11/01/21 18:50:17 WARN mapreduce.JobSubmitter: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
  4. 11/01/21 18:50:17 INFO input.FileInputFormat: Total input paths to process : 2
  5. 11/01/21 18:50:17 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
  6. 11/01/21 18:50:17 INFO mapreduce.JobSubmitter: number of splits:2
  7. 11/01/21 18:50:18 INFO mapreduce.JobSubmitter: adding the following namenodes' delegation tokens:null
  8. 11/01/21 18:50:18 INFO mapreduce.Job: Running job: job_201101211705_0001
  9. 11/01/21 18:50:19 INFO mapreduce.Job: map 0% reduce 0%
  10. 11/01/21 18:50:35 INFO mapreduce.Job: map 100% reduce 0%
  11. 11/01/21 18:50:44 INFO mapreduce.Job: map 100% reduce 100%
  12. 11/01/21 18:50:47 INFO mapreduce.Job: Job complete: job_201101211705_0001
  13. 11/01/21 18:50:47 INFO mapreduce.Job: Counters: 33
  14. FileInputFormatCounters
  15. BYTES_READ=58
  16. FileSystemCounters
  17. FILE_BYTES_READ=118
  18. FILE_BYTES_WRITTEN=306
  19. HDFS_BYTES_READ=300
  20. HDFS_BYTES_WRITTEN=68
  21. Shuffle Errors
  22. BAD_ID=0
  23. CONNECTION=0
  24. IO_ERROR=0
  25. WRONG_LENGTH=0
  26. WRONG_MAP=0
  27. WRONG_REDUCE=0
  28. Job Counters
  29. Data-local map tasks=2
  30. Total time spent by all maps waiting after reserving slots (ms)=0
  31. Total time spent by all reduces waiting after reserving slots (ms)=0
  32. SLOTS_MILLIS_MAPS=22290
  33. SLOTS_MILLIS_REDUCES=6539
  34. Launched map tasks=2
  35. Launched reduce tasks=1
  36. Map-Reduce Framework
  37. Combine input records=9
  38. Combine output records=9
  39. Failed Shuffles=0
  40. GC time elapsed (ms)=642
  41. Map input records=2
  42. Map output bytes=94
  43. Map output records=9
  44. Merged Map outputs=2
  45. Reduce input groups=8
  46. Reduce input records=9
  47. Reduce output records=8
  48. Reduce shuffle bytes=124
  49. Shuffled Maps =2
  50. Spilled Records=18
  51. SPLIT_RAW_BYTES=242
查看执行结果:
bin/hadoop dfs -ls test-out   显示:
  1. Found 2 items
  2. -rw-r--r-- 1 hadoop supergroup 0 2011-01-21 18:50 /user/hadoop/test-out/_SUCCESS
  3. -rw-r--r-- 1 hadoop supergroup 68 2011-01-21 18:50 /user/hadoop/test-out/part-r-00000
 查看最终统计结果:(执行命令)
bin/hadoop dfs -cat  test-out/part-r-00000     显示统计结果,统计了每次词在文件中出现的次数
  1. a 1
  2. bird 1
  3. count 1
  4. hadoop 1
  5. hadoop,I'm 1
  6. hello 2
  7. vegetable 1
  8. word 1
09-25 21:29