The (Hadoop) Distributed File System is based on Google's [[GFS paper|http://labs.google.com/papers/gfs.html]]) for a distributed, failure-resistant file system.\n\nThe Hadoop Distributed File system is often abbreviated HDFS or even DFS. These terms are used interchangible throughout the documentation.\n\nDFS is managed my a NameNode process. File
*Help! My MapReduce failed! What do I do?\n>Visit {{{http://}}}//{{{JobTracker_Machine_Name}}}//{{{:50030/jobtracker.jsp}}} Locate the MapReduce job which failed (under Failed Jobs). Clicking on the job name should bring you to a page with an overview and links to failure pages. From the failure pages you should be able to see which tasks failed and for what reason. Sometimes no reason will be given. Usually this just means the process failed before it could hand off the exception and notify the JobTracker. Don't worry, likely the next exception will be the same one as before, and one of them will be listed at some point. The exception should provide a stack trace. You can use this to see where your task failed.\n>Alternatively, you can search the log files. Look at the JobTracker page for this job to see which machines failed, and check the logs on that machine. There are a lot of log files, so you'll probably want to check the date to see which were updated around the time of failure. If you started your tasks manually, and didn't redirect your errrors, they will be listed on {{{stdout}}}, instead of files in the logs directory. You should find a stack trace with the exception there, even if it didn't show up on the JobTracker page.\n\n*I've got a non-descript error! (i.e. "~JobFailed" was all the description you were given)\n>If you get an error and your MapReduce job is unable to continue, the process which invoked the job ([[bin/hadoop]] //{{{classname}}}//) will report: something like\n{{{\n06/09/31 10:42:56 INFO mapred.JobClient: map 100% reduce 100%\nException in thread "main" java.io.IOException: Job failed!\n at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:363)\n at com.google.hadoop.examples.MainClass.main(MainClass.java:355)\n}}}\n>This is just the JobTracker's "clean" exit. It means your job failed and continued to fail after multiple restarts. You should check the log files to see which error specifically was the culprit. You should find a stack trace.\n\n*I keep getting the error below. What does it mean?\n{{{\njava.io.IOException: wrong value class: blah_blah_as_string is not class blah_blah\n at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:171)\n at org.apache.hadoop.mapred.MapTask$2.collect(MapTask.java:147)\n at org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:39)\n at org.apache.hadoop.mapred.CombiningCollector.flush(CombiningCollector.java:79)\n at org.apache.hadoop.mapred.CombiningCollector.collect(CombiningCollector.java:71)\n at com.google.hadoop.examples.Spellcheck$MapSpellcheck.map(Spellcheck.java:149)\n at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)\n at org.apache.hadoop.mapred.MapTask.run(MapTask.java:196)\n at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1075)\n}}}\n>This error isn't in your code, so how could you have caused it? This particular error is from setting an incompatible class for {{{JobConf.setOutputValueClass(}}}//{{{class_name}}}//{{{)}}}. Similar errors an arrise with any of the other {{{set*Class}}} methods. Check your code to make sure that your Map and Reduce classes are returning the proper objects as specified by your set*Class functions. Remember, the classes for {key, value} must implement {WritableComparable, Writable}, respectively.\n\n*What is a {{{.()}}} function? I didn't define anything like this, why is it trying to use a function I didn't write or call?\n{{{\njava.lang.RuntimeException: java.lang.NoSuchMethodException: com.google.hadoop.examples.Simple$MyMapper.()\n at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:45)\n at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:32)\n at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:53)\n at org.apache.hadoop.mapred.MapTask.run(MapTask.java:209)\n at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1210)\nCaused by: java.lang.NoSuchMethodException: com.google.hadoop.examples.Simple$MyMapper.()\n at java.lang.Class.getConstructor0(Class.java:2705)\n at java.lang.Class.getDeclaredConstructor(Class.java:1984)\n at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:41)\n ... 4 more\n}}}\n>This is actually the {{{<init>()}}} function. The display on the web page doesn't translate into html, but dumps plain text, so {{{<init>}}} is treated as a (nonexistant) tag by your browser. This function is created as a default initializer for non-static classes. This is most likely caused by having a non-static {{{Mapper}}} or {{{Reducer}}} class. Try adding the {{{static}}} keyword to your class declaration, ie:\n{{{ public static class MyMapper extends MapReduceBase implements Mapper {...} }}}\n
Hadoop is highly configurable. Configuration files tend to be kept in {{{conf}}}, but alternate configuration directories can be specified on the command-line as:\n{{{bin/hadoop --config }}}//{{{conf_dir}}}//{{{ }}}//{{{rest_of_args}}}//\n!Usage:\n{{{bin/hadoop --config }}}//{{{conf_dir}}}//{{{ }}}//{{{rest_of_args}}}//\n!Files:\n[[conf/hadoop-site.xml]]\n[[conf/hadoop-default.xml]]\n[[conf/slaves]]\n[[conf/mapred-default.xml]]\n[[conf/hadoop-env.sh]]\n[[conf/configuration.xsl]]\n[[conf/commons-loggin.properties]]\n
This is the [[(H)DFS]] daemon which reads and write
Before you get started with [[Hadoop]] I recommend creating a standard filesystem layout across each machine. This makes it possible to use the [[bin/start-all.sh]] and [[bin/stop-all.sh]] scripts instead of manually starting and stopping everything. Mine is below:\n/\n>hadoop\n>>filesystem\n>>>data - Only on the machines which run [[DataNode]]s.\n>>>name - Only on the machine which runs the NameNode.\n>>>mapreduce\n>>home\n>>>in - Local copies of small testing input files\n>>>out - Local copies of the small output files from the testing files in {{{in}}}\n>>>workspace - Where I have my Eclipse projects and other source code\n>>>>//workspace_name//\n>>>>>bin -> ../../../mr/build/classes - So recently built files are used instead of anything in the jar files in lib; useful for changing debug messages\n>>>>>conf -> ../../../mr/conf - Just so I have the config files for my project show up in Eclipse.\n>>mr - This directory hosts all the files extracted from one of the Hadoop releases. Paths listed in this guide are relative to this directory.\n>>>bin\n>>>build\n>>>>classes - The classes here are used before those in the jar files in lib.\n>>>conf\n>>>>hadoop-site.xml\n>>>>slaves\n>>>>...\n>>>lib\n>>>plugins - Used only if you're building Nutch or using it's library.\n>>>logs - Where logs files are stored when processes are started via [[bin/start-all.sh]].\n\n----\nSee Also:\n[[Official Configuration Guide|http://wiki.apache.org/lucene-hadoop/HowToConfigure]]. I have not tried having an external configuration though considering mine are mirrored, I should probably make this change.\n
[[Hadoop|http://lucene.apache.org/hadoop/]] is an open-source project led by the [[apache|http://www.apache.org/]] foundation to emulate Google's MapReduce. Hadoop is a sub-project of the [[lucene|http://lucene.apache.org/]] search engine framework and a sister-project of the [[nutch|http://lucene.apache.org/nutch/]] search engine. Hadoop contains much of the distributed computing framework that was once a part of Nutch . As an apache project, it is released under [[version 2.0 of the apache license|http://www.apache.org/licenses/LICENSE-2.0.txt]].\n----\nSee Also:\nMapReduce\n[[(H)DFS]]\n\n[[The Hadoop wiki|http://wiki.apache.org/lucene-hadoop/]]\n[[The Hadoop API|http://lucene.apache.org/hadoop/docs/api/]]\n[[The Hadoop home page|http://lucene.apache.org/hadoop/]]\n[[The Apache home page|http://www.apache.org/]]\n[[The Lucene home page|http://lucene.apache.org/]]
Notes on Hadoop
The classes which implement {{{Mapper}}} and {{{Reducer}}} should also extend {{{MapReduceBase}}}. {{{MapReduceBase}}} provides a {{{configure}}} method which may be overridden. This provides access to the {{{JobConf}}}. The JobConf can store extra parameters to your Map and Reduce methods.\n{{{\nJobConf conf;\n// ...\nconf.set(arg_name, arg_string_value)\n}}}\n//{{{arg_name}}}// is usually something like {{{mapred.formals.}}}//{{{name}}}//, but it can be any arbitrary string.\nWhile //{{{arg_string_value}}}// can be any {{{Object}}}, it actually only stores the {{{.toString()}}} value, so make sure you can reconstruct what you want from a string value.\n\nYou can override the {{{configure}}} method to store in local variables any parameters you wish. For examples:\n{{{\npublic static class MyMapper extends MapReduceBase implements Mapper {\n public static final String id_configValue = "mapred.formals.configvalue";\n private static final String default_configValue = "sample";\n private String configValue;\n\n public void configure(JobConf job) {\n this.configValue = job.get(id_configValue, default_configValue);\n }\n\n // ...\n}\n}}}
The new version 0.8 of Nutch uses [[Hadoop]] and MapReduce for doing its processing (where applicable).\n\nDisclaimer: Since I have only really used nutch for fetching, you may want to check Nutch's [[homepage|http://lucene.apache.org/nutch/]] and user's [[mailing lists|http://lucene.apache.org/nutch/mailing_lists.html]] for specifics and optimizations and other performance improvements. As far as I can tell, it's rather inefficient to run everything off the networked filesystem ([[(H)DFS]]). They seem to prefer breaking each segment part to each machine's local file system, and letting it go from there. I haven't tried this, but for a tiny crawl of some sixty-thousand documents, search was instant, and that was good enough for me at the time.\n\nI recommend setting up a similar filesystem on all of the machines which you wish to run your nutch crawl and indexing. See [[GettingStarted]].\n\nNutch has a simple cycle for pulling down successive pages. Do not use the {{{crawl}}} parameter. It is for intranets, and not really suited for web-crawls.\n\n#{{{home$: vi urls/starting_urls.txt}}}\n**You'll need a starting list of urls from which to pull your initial pages before you can branch out. These need to be in a plaintext file, one per line, in a directory on the [[(H)DFS]].\n#{{{home$: ../mr/bin/hadoop dfs -put urls /urls}}}}\n**Move to [[(H)DFS]]\n#{{{home$: ../mr/bin/nutch inject crawl/crawldb /urls}}}\n**This will create a list of sites to grab from the plaintext files in {{{/urls}}} and write the list out to {{crawl/crawldb}}}.\n#<start repeat>\n#{{{home$: ../mr/bin/nutch generate crawl/crawldb crawl/segments}}}\n#{{{home$: ../mr/bin/nutch fetch crawl/segments/}}}//{{{latest_segment}}}//{{{/}}}\n**This is what actually goes out and fetches the web pages, putting them into the //{{{latest_segment}}}// directory. WARNING, this can take a //long// time, depending on the current depth (how many previous times you have done this).\n#{{{home$: ../mr/bin/nutch updatedb crawl/crawldb crawl/segments/}}}//{{{latest_segment}}}//{{{/}}}\n**Re-inject the fetch-list with outgoing links from the recently-fetched pages in //{{{latest_segment}}}//\n#<end repeat>\n**The above steps can be repeated, as many times as desired, for various link depths. Using a starting list of 6 popular urls (slashdot, wikipedia, bbc, and some universities), the first crawl took me a few minutes, then a few hours, and the fourth took a few days, producing a total of over 65,000 unique documents.\n#{{{home$: ../mr/bin/nutch invertlinks crawl/linkdb crawl/segments/*\n**This will create an inverted index, for use in searching\n#{{{home$: ../mr/bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments*\n**Actually build the index\n#{{{home$: ../mr/bin/nutch server 9003 crawl\n**This starts a webserver on port 9003, using the crawled index (at {{{crawl}}}). This is what feeds results to the {{{.jsp}}} pages served by tomcat.\n#{{{home$ echo "$HOST 9003 crawl" > crawl/search-servers.txt}}}\n**search-servers.txt is a listing of servers hosting parts of the index to contact for the {{{.jsp}}} search pages\n#{{{home$: rm -rf ../tomcat/webapps/ROOT}}}\n#{{{home$: cp ../mr/nutch*.war ../tomcat/webapps/ROOT.war}}}\n**Make sure the {{{.jsp}}} files being served are current and legit\n#{{{home$: ../tomcat/bin/catalina.sh run}}}\n**Starts the server at {{{http://$HOST:8080/}}}\n
Create a {{{.jar}}} with your class files (keeping the appropriate directory hierarchy). If you don't specify a main class, you'll have to do so on the command-line. This can be done easily in eclipse from Export -> Java / ~JarFile, then selecting your project classes.\n
This assumes you know a little bit about the workings of Nutch, namely the generate, fetch, updatedb cycle.\n\nNutch fetches pages into what it calls "segments". Each segment is a collection of all the different files pulled down during a given fetch cycle. Each segment has several classes of information stored about it, each in different parts.\n\nWhile Nutch does use Hadoop and MapReduce to go through it's tasks, it's a bit more work to get your Hadoop class working with Nutch's configurations and libraries. Nutch has many external dependencies, and you'll have to include several libraries for seemingly simple tasks.\n\n!Libraries to include:\nThe following files are from various projects in Apache, they should be distributed in the {{{lib}}} directory within the Nutch release which was run to generate the segments. They are:\n{{{commons-lang-*.jar}}}\n{{{jakarta-oro-*.jar}}}\n{{{xerces-*-apis.jar}}}\n{{{xerces-*.jar}}}\nIf they are not alread there, you'll have to add these files as external jars to your Eclipse project (or include them in the classpath), and place them in your distribution's {{{lib}}} directory on each machine which is going to be running your MapReduce. As with all updates to files which must be included in the CLASSPATH, you'll have to restart your JobTracker and [[TaskTracker]]s before they will be found.\n\n!Configuration:\nYou'll need to set a few configuration settings before it stops crashing on you with odd errors. It's easiest to just include the default Nutch configuration files: {{{/nutch/conf/nutch-default.xml}}} {{{/nutch/conf/parse-plugins.xml}}} {{{/nutch/conf/parse-plugins.dtd}}}. Place these into your {{{conf}}} folder, and in your set up of your job, ensure you add this line:\n{{{\nJobConf conf;\n// ...\nconf.addDefaultResource("nutch-default.xml");\n}}}\nThis will make sure the settings required for nutch are imported, and plugins like the parsers are available.\n\n!Input Format:\nNext, you'll have to make sure to use the right input class so the segments are read in properly. This is done with the {{{org.apache.hadoop.SequenceFileInputFormat}}} class.\n{{{\nJobConf conf;\n// ...\nconf.setInputFormat(SequenceFileInputFormat.class);\n}}}\n\nYou don't have to set the (depreciated) InputKeyClass or InputValueClass, as these are now read from the sequence file and loaded dynamically. However, since they are loaded dynamically, you will see a run-time, not a compile-time warning should they not be available on the CLASSPATH.\n\n!Usage:\nFinally, you'll need to ensure your {{{map}}} function is ready to recieve the proper keys and values. For the Nutch segments, this is {{{{UTF8}}}, {{{Content}}}}. NOTE: {{{UTF8}}} is depreciated. You can use the {{{key.toString()}}} method to get the originating url of the page in question, or you can get it from the {{{(Content)value.getUrl()}}}, which is recommended.\n\nIf you want to parse the HTML from a page, use the {{{Parse}}} and {{{ParseUtil}}} classes. The example {{{map}}} function below merely extracts the data from a page and outputs it. <<slider chkSlider "src:com.google.hadoop.examples.SegmentReader.java" "(show file)" Click to show file: com.google.hadoop.examples.SegmentReader.java">>\n\n!Invoking:\nMake sure you specify a specific segments content directory, otherwise you'll get other metadata (I'm unfamiliar with what, exactly, is in these other files).\ni.e. {{{bin/hadoop com.google.hadoop.examples.SegmentReader crawl/segments/20060931185522/content /segments/out/plain/20060931185522}}}\n
You can specify different key and value classes for your {{{Map}}} output than you do for your {{{Reduce}}} output.\n\nReduce output classes are set by:\n{{{\nJobConf conf;\n// ...\nconf.setOutputKeyClass(Implements_WritableComparable.class);\nconf.setOutputValueClass(Implements_Writable.class);\n}}}\nUnless otherwise specified, these will be the classes expected to be output by {{{Map}}}. You can specify the {{{Map}}} output classes specifically by:\n{{{\nJobConf conf;\n// ...\nconf.setMapOutputKeyClass(Implements_WritableComparable.class);\nconf.setMapOutputValueClass(Implements_Writable.class);\n}}}\n\nBy default, the {{{OutputKeyClass}}} is {{{LongWritable}}}, and the {{{OutputValueClass}}} is {{{Text}}} (which is suitible for the default {{{InputFormat}}} of text files.\n\nNaturally, the output classes of {{{Map}}} will be the classes input to {{{Reduce}}}.\n\nAs with all {{{set*Class}}} methods of {{{JobConf}}}, the actual values stored are strings, so you can, in fact, specify the classes via their fully-qualified names, which is how you would do so if, for whatever reason, you specified these values in an external configuration file or on the command-line. But since most of these settings are specific to the {{{Map}}} and {{{Reduce}}} classes, it is usually more appropriate to specify these values in code.\n
To use a different file format as your input you'll need to write a class which extends org.apache.hadoop.mapred.~InputFormatBase. There is only one required method which must be overridden: {{{getRecordReader}}}. Examine [[InputFormatBase|http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/InputFormatBase.html]] more thoroughly for details as to other methods are available to override.\n\nSource for an example implementation is here [[src:com.google.hadoop.lib.mapred.GzipTextInputFormat.java]]. The example opens up a gzip-compressed text (UTF-8) file, decompresses it, and passes the lines up to Map as {character offset in decompressed file, line}. <<slider chkSlider "src:com.google.hadoop.lib.mapred.GzipTextInputFormat.java" "(show file)" Click to show file: com.google.hadoop.lib.mapred.GzipTextInputFormat.java">>\n\nIn your {{{main()}}} class, then as you set up your JobConf, you'll need to specify:\n{{{\nJobConf conf;\n// ...\nconf.setInputFormat(GzipTextInputFormat.class);\n}}}\nAfter that, the file acts like a plain text input file (key: LongWritable, value: Text), and you can use it the same way in your Map and Reduce classes.\n
* [[How do I run my code?]]\n* [[How do I use different classes for my Map output than I use for my Reduce output?]]\n* [[How do I use gzip files as my input?]]\n* [[How do I pass parameters to my Map and Reduce classes?]]\n* How can my Map and Reduce functions use {{{JobConf}}} settings? see: [[How do I pass parameters to my Map and Reduce classes?]]\n* [[How do I perform a Nutch crawl?]]\n* [[How do I search the pages fetched by Nutch?]]\n
This is the main MapReduce daemon. It is typically only started on one machine in the cluster. It manages running MapReduce jobs and delegates them to the various TaskTracker worker nodes.\n!Usage:\n{{{ hadoop jobtracker}}}\n!Configuration:\n[[conf/hadoop-site.xml]]\n*name: {{{mapred.job.tracker}}}\n**value: //{{{machine_name}}}//{{{:}}}//{{{port_number}}}//\n**note: The configuration for the [[TaskTracker]]s which connect to this JobTracker must be the same {{{machine_name}}} and {{{port_number}}} as listed here.\n**note: The use of {{{local}}} instead of //{{{machine_name}}}//{{{:}}}//{{{port_number}}}// seems deprecated. This has not been confirmed. {{{local}}} would indicate that jobs are to be run in-process as a single map and reduce task.\n*name: {{{mapred.map.tasks}}}\n**value: //{{{4}}}//\n**note: Usually set to either a prime number several times greater than the number of available [[TaskTracker]]s, or just twice the number of [[TaskTracker]]s (documentation differs).\n**note: Ignored when {{{mapred.job.tracker}}} is {{{local}}}.\n*name: {{{mapred.reduce.tasks}}}\n**value: //{{{4}}}//\n**note: Usually set to either a prime number around the number of available [[TaskTracker]]s.\n**note: This is also the number of output files from the reduction stage. If fewer files are desierd, but there is a lot of work to do in the reduction stage, post-processing, possibly as another MapReduce with fewer reduce tasks, must be done.\n**note: Ignored when {{{mapred.job.tracker}}} is {{{local}}}.\n*name: {{{mapred.local.dir}}}\n**value: //{{{/tmp/hadoop/mapred/local}}}//\n**note: The local directory where intermediate data files are stored. May be comma-separated list of directories on different devies in order to spread disk i/o.\n**note: Non-existant directories are ignored.\n<<slider chkSlider __MapReduceConfiguration_More "more" "Less common configuration options.">>\n----\nSee Also:\n[[bin/hadoop]]\nTaskTracker\n
GettingStarted\n[[Terms]] <<slider chkSlider Terms "(expand)" "Click to show individual terms">>\n[[Configuration]] <<slider chkSlider __Configuration_Files "(expand)" "Click to show individual terms">>\n[[Setup]] <<slider chkSlider __Setup_Steps "(expand)" Click to show setup steps">>\n[[Working with Eclipse]]\n[[Writing a MapReduce]]\n\n[[Official Links]]\n[[Common Errors]]\n[[Status Pages]]\n[[How do I...]]
MapReduce was created by Google to simplify parallel computation independant records in a large corpus of data. A paper was [[published|http://labs.google.com/papers/mapreduce.html]] describing the process in 2004.\n\nHadoop is an open-source project led by the [[apache|http://www.apache.org/]] foundation to emulate Google's MapReduce.\n----\nSee Also:\n[[(H)DFS]]\n[[Hadoop]]\n
This is the main [[(H)DFS]] daemon. It administers all the [[DataNode]]s. [[DataNode]]s announce themselves to the NameNode when they startup.\n\nThe NameNode contains an index of which [[DataNode]]s are active, what blocks they have, and what files map to which blocks. When a file is requested, the NameNode can direct the request to the appropriate DataNode.\n\n!Important!\nDo not kill the NameNode process before you kill the DataNode processes. This can put the [[(H)DFS]] into an invalid state. The only way known to recover is to reformat.\n\n!Usage:\n{{{ hadoop namenode}}}\n{{{ hadoop namenode -format}}}\n!Configuration:\n[[conf/hadoop-site.xml]]\n*name: {{{fs.default.name}}}\n**value: //{{{machine_name}}}//{{{:}}}//{{{port_number}}}//\n**note: The configuration for the [[DataNode]]s which connect to this NameNode must be the same //{{{machine_name}}}// and //{{{port_number}}}// as listed here.\n*name: {{{dfs.name.dir}}}\n**value: //{{{/hadoop/filesystem/name}}}//\n*name: {{{dfs.data.dir}}}\n**value: //{{{/hadoop/filesystem/data}}}//\n*name: {{{dfs.replication}}}\n**value: //{{{3}}}//\n!Log:\n[[logs/hadoop-$USER-namenode-$HOSTNAME.log]]\n!Status Page:\nhttp://$HOSTNAME:50070/dfshealth.jsp\n----\nSee Also:\n[[bin/hadoop]]\nDataNode\n
[[Hadoop's home page|lucene.apache.org/hadoop/]]\n[[Hadoop's license|http://www.apache.org/licenses/LICENSE-2.0.html]]\n[[Hadoop's Wiki|http://wiki.apache.org/lucene-hadoop/HadoopOverview]]; mainly development updates.\n[[Hadoop's API/Code documentation|http://lucene.apache.org/hadoop/docs/api/]]; these are just the javadocs, but useful if you don't feel like reading the actual source.\n\n[[Apache|www.apache.org/]], Hadoop's project sponser\n[[Lucene, Hadoop's parent project|lucene.apache.org/]], less related to Hadoop, and more related to the free search engine [[Nutch|lucene.apache.org/nutch/]].\n
There are several ways to start hadoop, and several different parts to the process.\n\n!Simple, batch:\nThe simplest way is to run [[bin/start-all.sh]]. This will start the [[TaskTracker]]s and [[DataNode]]s on all the machines listed in the [[conf/slaves]] file. It will also start the NameNode and JobTracker on the current machine. All output is piped to the {{{.log}}} and {{{.out}}} files in the {{{logs/}}} directory, named according to the format\n{{{logs/hadoop-$USER-}}}//{{{task}}}//{{{-$HOST.(out|log)}}}\n\n!Individually:\nAlternatively, you can start each needed process individually and watch the logs in the terminal.\n!!NameNode\nStart the NameNode on one machine with\n{{{bin/hadoop namenode}}}\n!!JobTracker\nStart the JobTracker on one machine with\n{{{bin/hadoop jobtracker}}}\n!!DataNode\nStart the DataNode on each machine on which you wish to store data with\n{{{bin/hadoop datanode}}}\n!!TaskTracker\nStart the TaskTracker on each machine which you wish to run MapReduce jobs with\n{{{bin/hadoop tasktracker}}}\n\n!!MapReduce\nMapReduce jobs can now be run using\n{{{bin/hadoop }}}//{{{main_class}}}//{{{ [}}}//{{{args}}}//{{{]}}}\nIf you're using a jar file (which you probably will be), use\n{{{bin/hadoop jar }}}//{{{my_jar.jar}}}//{{{ [}}}//{{{main_class}}}//{{{] [}}}//{{{args}}}//{{{]}}}\nMapReduce jobs can be examined with\n{{{bin/hadoop job -status }}}//{{{job_id}}}//\nor at the url\n{{{http://}}}//{{{job_tracker_machine_name}}}//{{{:50030/jobdetails.jsp?jobid=}}}//{{{job_id}}}//\nor killed with\n{{{bin/hadoop job -kill }}}//{{{job_id}}}//\nwhere //{{{job_id}}}// is usually of the form {{{job_####}}}\n
A personal wiki on all things Hadoop
Hadoop ~MiniWiki
Note that these port numbers are configurable, in case you want to run multiple [[JobTracker]]s, [[TaskTracker]]s, [[NameNode]]s, or [[DataNode]]s per machine. Port numbers provided are specified in [[conf/hadoop-default.xml]]\nhttp://JobTracker_Machine_Name:50030/jobtracker.jsp\n>This page will give you the status of all your jobs as well as snippets of error logs for debugging purposes. Turn here if you get something like the nondescript "~JobFailed" from your invocation command-line.\nhttp://TaskTracker_Machine_Name:50060/tasktracker.jsp\n>This page will give details on which tasks the given node is working, as well as access to the logs directory. Note that this logs directory is simply the directory {{{logs}}}. If you have not redirected output to files in this directory (as is done by the [[bin/start-all.sh]] script), then you will have just logs describing the history of submitted jobs in the {{{logs/history}}} directory.\nhttp://DataNode_Machine_Name:50070/dfshealth.jsp\n>This page will give you the status of nodes in the [[(H)DFS]] cluster. You can see how much space the entire file system is using, as well as how much each machine in the cluster is using.\n
This is the client MapReduce daemon. It must be started on each machine which is to be a MapReduce worker. It can be on the same machine as the JobTracker.\n\n[[TaskTracker]]s will contact the JobTracker on startup, notifying it that it is ready to work.\n\nWhile mulitple [[TaskTracker]]s may be run on one machine, it is currently unknown if there is a benefit to this besides making one machine run more tasks than others. Unknown if having multiple trackers on one machine creates conflicts.\n!Usage:\n{{{ hadoop tasktracker}}}\n!Configuration:\n[[conf/hadoop-site.xml]]\n*name: {{{mapred.job.tracker}}}\n**value: //{{{machine_name}}}//{{{:}}}//{{{port_number}}}//\n**note: The configuration for the JobTracker to which this TaskTracker connects must be the same {{{machine_name}}} and {{{port_number}}} as listed here.\n*name: {{{mapred.map.tasks}}}\n**value: //{{{4}}}//\n**note: Usually set to either a prime number several times greater than the number of available [[TaskTracker]]s, or just twice the number of [[TaskTracker]]s (documentation differs).\n**note: Ignored when {{{mapred.job.tracker}}} is {{{local}}}.\n**note: This may use the value from the JobTracker, or it may use the value from the local config. Operation specifics unknown.\n*name: {{{mapred.reduce.tasks}}}\n**value: //{{{4}}}//\n**note: Usually set to either a prime number around the number of available [[TaskTracker]]s.\n**note: This may use the value from the JobTracker, or it may use the value from the local config. Operation specifics unknown.\n**note: This is also the number of output files from the reduction stage. If fewer files are desierd, but there is a lot of work to do in the reduction stage, post-processing, possibly as another MapReduce whith fewer reduce tasks, must be done.\n*name: {{{mapred.local.dir}}}\n**value: //{{{/tmp/hadoop/mapred/local}}}//\n**note: The local directory where intermediate data files are stored. May be comma-separated list of directories on different devies in order to spread disk i/o.\n**note: Non-existant directories are ignored.\n<<slider chkSlider __MapReduceConfiguration_More "more" "Less common configuration options.">>\n----\nSee Also:\n[[bin/hadoop]]\nJobTracker\n
*[[(H)DFS]]\n*[[bin/hadoop]]\n*[[conf/hadoop-default.xml]]\n*[[conf/hadoop-site.xml]]\n*DataNode\n*[[Hadoop]]\n*JobTracker\n*MapReduce\n*NameNode\n*TaskTracker
You'll have to make sure that all the {{{.jar}}} files in [[lib/jar]] are included in your Java Build Path.\n>Under Libraries, click "Add External ~JARs..." and select all the {{{.jar}}} files in [[lib/jar]] and its subfolders.\nIf you want to browse the source files (which can be very handy in debugging), you'll want to add them.\n>Right click on your project folder in the package view and select {{{Import -> General / File System}}}. Selecting the proper directory is important to get packages to show up properly. From directory should be {{{(/hadoop/mr/)src}}}. Expand the folder {{{java}}}. Check the {{{org}}} folder. The folder will show up under the project folder. Right-click on it and select {{{Build Path -> Use as Source Folder}}}.\n>My workspace is laid out as follows:\n>workspace\n>>Hadoop\n>>>{{{bin -> /hadoop/mr/build/classes }}}- {{{build/classes}}} is added to the CLASSPATH environment variable by [[bin/hadoop]]. This directory contains the class files for the Hadoop source code. If you're going to change the source code, you'll want to build your classes to here, where they'll be used before any jar libraries. If not, just have the {{{bin}}} folder be a normal directory and not a symlink, and the class files will come from the release jar.\n>>>{{{conf -> /hadoop/mr/conf/ }}}- Just a convenience to have the configuration files available for my project.\n>>>{{{libs }}}- A copy of /hadoop/mr/src/java (so I can fix bugs or put more debugging info into the source code).\n>>>{{{src }}}- My source code.\n>>>{{{exec }}}- Where I put my executable {{{.jar}}} files for MapReduce jobs.
MapReduce comes from two standard features in many function programming languages.\n\nThe Map function takes a {key, value}. performs some computation on them, then outputs a {key, value}.\n{{{\nMap: (key, value) -> (key, value)\n}}}\n\nThe Reduce function takes a {key, value[]}, performs some computation on them, then outputs a {key, value}[].\n{{{\nReduce: (key, value[]) -> (key, value)[]\n}}}\n\n{{{output.collect(}}}//{{{key}}}//{{{, }}}//{{{value}}}//{{{);}}} is how Map and Reduce functions emit their {key, value} pairs.\n\nWriting a MapReduce is as simple as writing these two functions (or using already-provided functions), then telling a job object which functions you want to use. <<slider chkSlider __MapReduceTrivial "Trivial Identity MapReduce." "Click to show the trivial map reduce.">>\n\nThis may seem like a lot of code to setup, but it's pretty simple and straightforward.\n\n!~JobConf\nThere are many options for ~JobConf objects. For complete details see [[JobConf|http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/JobConf.html]] on the [[Hadoop API|http://lucene.apache.org/hadoop/docs/api/]] page. I shall cover what I have found to be the most useful options here.\n\n*{{{JobConf(Class }}}//{{{main_class}}}//{{{);}}}\n**This initializes the configuration. Use the class for your main file to assist with logging.\n**For those of you not familiar with reflection, you can use the //{{{ClassName}}}//{{{.class}}} to specify a class statically (which is mostly what you will be doing).\n*{{{conf.setCombinerClass(Class }}}//{{{combiner}}}//{{{);}}}\n**//{{{combiner}}}//{{{ implements Reducer}}}\n**The combiner class is like a Reducer, in fact, it is often the same class as the Reducer. It operates after all the Maps have been done on a single node, but before gathering up the keys from the other nodes. This is an optimization to make use of all the data already in memory on one machine. It is most useful in examples like WordCount, in which the Map simply outputs {{{(}}}//{{{key}}}//{{{,1)}}} pairs. The combiner can easily reduce the amount of data sent to the Reduce phase by pre-summing the pairs.\n*{{{conf.setOutputKeyClass(Class }}}///{{{obj}}}//{{{);}}}\n**//{{{obj}}}//{{{ implements WritableComparable}}}\n**This is the type of the keys output by the {{{Reduce}}} function. Unless {{{setMapOutputKey}}} is used, it is also the type of the keys emitted by {{{Map}}}.\n*{{{conf.setOutputValueClass(Class }}}///{{{obj}}}//{{{);}}}\n**//{{{obj}}}//{{{ implements WritableComparable}}}\n**This is the type of the values output by the {{{Reduce}}} function. Unless {{{setMapOutputValue}}} is used, it is also the type of the values emitted by {{{Map}}}.\n*{{{conf.setInputFormat(Class }}}//{{{format}}}//{{{);}}}\n**//{{{format}}}//{{{ extends InputFormatBase}}}\n**This class provides methods for reading in custom input formats. This is done using a (usually inlined) subclass of {{{RecordReader}}}\n**See [[InputFormatBase|http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/InputFormatBase.html]].\n**You may optionally specify how files are to be split if they are too big.\n**You can also specify which files within a directory are going to be used as input, instead of the all of them.\n*{{{conf.setNumReduceTasks(}}}//{{{1}}}//{{{);}}}\n**This controls the number of reduce tasks, and consequently the number of output files.\n**You usually want to specify this in the [[conf/hadoop-site.xml]] config file, but sometimes it only makes sense to have a single output file, as when sorting. In other cases setting this should be avoided, so as to make best use of cluster resources.\n*{{{conf.set(String }}}//{{{property_name}}}//{{{, String }}}//{{{value}}}//{{{);}}}\n*{{{conf.get(String }}}//{{{property_name}}}//{{{, String }}}//{{{default_value}}}//{{{);}}}\n**In theory, //{{{value}}}// can be an {{{Object}}}, but this it just {{.toString}}}'d.\n**Use these methods to pass other parameters to your Map and Reduce classes.\n\n!Mapper & Reducer\nWhile {{{map}}} is the only required function to overload in a Mapper class, and Reducer is the only required function to overload in a Reducer class, there are a few other useful ones which are inherited from ~MapReduceBase. For complete details see [[MapReduceBase|http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/MapReduceBase.html]] on the Hadoop API page. I shall cover what I have found to be the most useful options here.\n\n*{{{configure(JobConf job)}}}\n**Use this method to initialize your Mapper or Reducer object. This is where you have access to the JobConf object for the current job. You must extract any parameters here. It is called before {{{map}}} and {{{reduce}}}.\n*{{{close()}}}\n**This method is called after {{{map}}} and {{{reduce}}}. Use it to perform any cleanup you initialized in {{{configure()}}}.\n
*[[conf/hadoop-site.xml]]\n*[[conf/hadoop-default.xml]]\n*[[conf/slaves]]\n*[[conf/mapred-default.xml]]\n*[[conf/hadoop-env.sh]]\n*[[conf/configuration.xsl]]\n*[[conf/commons-loggin.properties]]\n
To get started with this blank TiddlyWiki, you'll need to modify the following tiddlers:\n* SiteTitle & SiteSubtitle: The title and subtitle of the site, as shown above (after saving, they will also appear in the browser title bar)\n* MainMenu: The menu (usually on the left)\n* DefaultTiddlers: Contains the names of the tiddlers that you want to appear when the TiddlyWiki is opened\nYou'll also need to enter your username for signing your edits: <<option txtUserName>>
*name: {{{mapred.system.dir}}}\n**value: //{{{/tmp/hadoop/mapred/system}}}//\n**note: The ''shared'' directory where files are stored. This will be put into [[(H)DFS]] if present.\n*name: {{{mapred.temp.dir}}}\n**value: //{{/tmp/hadoop/mapred/temp}}}//\n**note:The ''shared'' directory for tmp files. This will be put into [[(H)DFS]] if present.\n*name: {{{mapred.local.dir.minspacestart}}}\n**value: //{{{0}}}//\n**note: If the space in {{{mapred.local.dir}}} drops under this, do not ask for more tasks.\n**note: Value in Bytes.\n*name {{{mapred.local.dir.minspacekill}}}\n**value: //{{{0}}}//\n**note: If the space in {{{mapred.local.dir}} drops under this, do not ask for more tasks until the current ones have finished and cleaned up. To save the rest of the running tasks, one will be killed to clean up some space. The first one cilled will be a reduce task, continuing along with those which have progressed the least.\n**note: Value in Bytes.\n
{{{\npackage com.google.hadoop.examples;\n\nimport java.io.IOException;\nimport java.util.Iterator;\n\nimport org.apache.hadoop.fs.Path;\nimport org.apache.hadoop.io.Writable;\nimport org.apache.hadoop.io.WritableComparable;\nimport org.apache.hadoop.mapred.JobClient;\nimport org.apache.hadoop.mapred.JobConf;\nimport org.apache.hadoop.mapred.MapReduceBase;\nimport org.apache.hadoop.mapred.Mapper;\nimport org.apache.hadoop.mapred.OutputCollector;\nimport org.apache.hadoop.mapred.Reducer;\nimport org.apache.hadoop.mapred.Reporter;\n\npublic class Trivial {\n \n public static void main(String[] args) throws IOException {\n JobConf conf = new JobConf(Trivial.class);\n conf.setJobName("trivial");\n \n conf.setMapperClass(Simple.IdentityMapper.class);\n conf.setReducerClass(Simple.IdentityReducer.class);\n\n if (args.length < 2) {\n System.out.println("ERROR: Wrong number of parameters");\n System.out.println("simple <input_path> <output_path>");\n System.exit(1);\n }\n \n conf.setInputPath(new Path(args[0]));\n conf.setOutputPath(new Path(args[1]));\n\n JobClient.runJob(conf);\n }\n\n public class IdentityMapper extends MapReduceBase implements Mapper {\n public void map(WritableComparable key, Writable val,\n OutputCollector output, Reporter reporter)\n throws IOException {\n output.collect(key, val);\n }\n }\n\n public class IdentityReducer extends MapReduceBase implements Reducer {\n public void reduce(WritableComparable key, Iterator values,\n OutputCollector output, Reporter reporter)\n throws IOException {\n while (values.hasNext()) {\n output.collect(key, (Writable)values.next());\n }\n }\n }\n}\n}}}
#Create a {{{.jar}}} with your class files (keeping the appropriate directory hierarchy). If you don't specify a main class, you'll have to do so on the command-line. This can be done easily in eclipse from Export -> Java / JarFile, then selecting your project classes.\n
#//{{{host_0}}}//{{{$ bin/hadoop namenode}}}\n#//{{{host_0}}}//{{{$ bin/hadoop jobtracker}}}\n#//{{{host_0}}}//{{{$ bin/hadoop datanode}}}\n#//{{{host_1}}}//{{{$ bin/hadoop datanode}}}\n#...\n#//{{{host_n}}}//{{{$ bin/hadoop datanode}}}\n#//{{{host_0}}}//{{{$ bin/hadoop tasktracker}}}\n#//{{{host_1}}}//{{{$ bin/hadoop tasktracker}}}\n#...\n#//{{{host_n}}}//{{{$ bin/hadoop tasktracker}}}\n
This is the shell script which sets up everything [[Hadoop]]. Use it to run the NameNode, JobTracker, DataNode, TaskTracker. Use it to monitor and manipulate running jobs.\n\nCommands started with with bin/hadoop cannot be terminated by killing this script. This script spawns java processes which must be killed separately.\n\n!Usage:\n{{{ hadoop [--config confdir] }}}//{{{command}}}//\n\n!!Where //command// is one of\n{{{ namenode -format }}}format the NameNode\n{{{ namenode }}}run a NameNode\n{{{ datanode }}}run a DataNode\n{{{ dfs }}}run a [[(H)DFS]] admin client\n{{{ fsck }}}run a [[(H)DFS]] filesystem checking utility\n{{{ jobtracker }}}run a MapReduce JobTracker node\n{{{ tasktracker }}}run a MapReduce TaskTracker node\n{{{ job }}}manipulate MapReduce jobs\n{{{ jar }}}//{{{jar}}}//{{{ }}}run a jar file\n{{{ distcp }}}//{{{srcurl}}}//{{{ }}}//{{{desturl}}}//copy file or directories recursively\nor\n{{{ CLASSNAME }}}run the class named CLASSNAME\n
This is the default configuration file for [[Hadoop]]. It is loaded before [[conf/hadoop-site.xml]], and should never be changed. All changes should be made in [[conf/hadoop-site.xml]] to override the defaults.\n\nThe structure is the same as [[conf/hadoop-site.xml]]. It is an xml document with top-level element {{{configuration}}} containing many {{{property}}} elements. Each {{{property}}} contains a {{{name}}}, a {{{value}}}, and an optional {{{description}}}, in that order. ~HTML-style comments may also be included ({{{<!-- this is a comment -->}}}).\n\nThie file is viewed with the [[conf/configuration.xsl]] stylesheet.\n----\nSee Also:\n[[conf/hadoop-site.xml]]\n
This script sets up environment variables for the jobs executed by the [[bin/hadoop]] script.\n\nThe only required variable is ~JAVA_HOME, which must be a directory containing the {{{java}}} executable in the subdirectory {{{bin}}}.\n!Usage:\n$~JAVA_HOME/bin/java\n----\nSee Also:\n[[bin/hadoop]]\n
This configuration file is processed after [[conf/hadoop-default.xml]]. Any {{{property}}} elements here override the {{{property}}} element in [[conf/hadoop-default.xml]] with the same {{{name}}} element. Any setting of {{{property}}} elements should be done in this file instead of [[conf/hadoop-default.xml]].\n\nThe structure is the same as [[conf/hadoop-default.xml]]. It is an xml document with top-level element {{{configuration}}} containing many {{{property}}} elements. Each {{{property}}} contains a {{{name}}}, a {{{value}}}, and an optional {{{description}}}, in that order. ~HTML-style comments may also be included ({{{<!-- this is a comment -->}}}).\n\nIf using multiple [[JobTracker]]s on one machine, make sure that you override the listening ports (like the webserver or tasktracker listener), as two processes can't listen on the same port.\n\nThie file is viewed with the [[conf/configuration.xsl]] stylesheet.\n----\nSee Also:\n[[conf/hadoop-default.xml]]\n
This configuration file is processed after [[conf/hadoop-site.xml]. Any {{{property}}} elements here override the {{{property}}} element in [[conf/hadoop-site.xml]] with the same {{{name}}} element. Any setting of {{{property}}} elements which are specific/unique to MapReduce jobs should be put here.\n\nThe structure is the same as [[conf/hadoop-site.xml]] and [[conf/hadoop-default.xml]]. It is an xml document with top-level element {{{configuration}}} containing many {{{property}}} elements. Each {{{property}}} contains a {{{name}}}, a {{{value}}}, and an optional {{{description}}}, in that order. ~HTML-style comments may also be included ({{{<!-- this is a comment -->}}}).\n\nThie file is viewed with the [[conf/configuration.xsl]] stylesheet.\n----\nSee Also:\n[[conf/hadoop-default.xml]]\n
This configuration file is a simple newline-delimited list of slave nodes. Slave nodes will have their TaskTracker and DataNode processes started.\n
{{{\n/*\n * Copyright 2006 Google Inc. All Rights Reserved.\n * \n * Author: Alden King\n */\n\npackage com.google.hadoop.examples;\n\nimport java.io.IOException;\nimport java.util.Iterator;\n\nimport org.apache.hadoop.fs.Path;\nimport org.apache.hadoop.io.Text;\nimport org.apache.hadoop.io.Writable;\nimport org.apache.hadoop.io.WritableComparable;\nimport org.apache.hadoop.mapred.JobClient;\nimport org.apache.hadoop.mapred.JobConf;\nimport org.apache.hadoop.mapred.MapReduceBase;\nimport org.apache.hadoop.mapred.Mapper;\nimport org.apache.hadoop.mapred.OutputCollector;\nimport org.apache.hadoop.mapred.Reducer;\nimport org.apache.hadoop.mapred.Reporter;\nimport org.apache.hadoop.mapred.SequenceFileInputFormat;\nimport org.apache.nutch.parse.Parse;\nimport org.apache.nutch.parse.ParseException;\nimport org.apache.nutch.parse.ParseStatus;\nimport org.apache.nutch.parse.ParseUtil;\nimport org.apache.nutch.protocol.Content;\n\npublic class SegmentReader {\n \n public static void main(String[] args) throws IOException {\n JobConf conf = new JobConf(SegmentReader.class);\n conf.setJobName("segment reader");\n \n conf.addDefaultResource("nutch-default.xml");\n\n conf.setInputFormat(SequenceFileInputFormat.class);\n \n //conf.setOutputFormat(ParseOutputFormat.class); // for serializing objects\n conf.setOutputKeyClass(Text.class);\n conf.setMapOutputValueClass(Text.class);\n conf.setOutputValueClass(Content.class);\n \n conf.setMapperClass(ExtractSegmentMapper.class);\n conf.setReducerClass(ExtractSegmentReducer.class);\n if (args.length < 2) {\n System.out.println("ERROR: Wrong number of parameters");\n System.out.println("simple <input_path> <output_path>");\n System.exit(1);\n }\n \n conf.setInputPath(new Path(args[0]));\n conf.setOutputPath(new Path(args[1]));\n\n JobClient.runJob(conf);\n }\n\n public static class ExtractSegmentMapper extends MapReduceBase implements Mapper {\n ParseUtil util;\n public void configure(JobConf conf) {\n util = new ParseUtil(conf);\n }\n public void map(WritableComparable key, Writable value,\n OutputCollector output, Reporter reporter)\n throws IOException {\n Content content = (Content)value;\n Parse parse = null;\n ParseStatus status = null;\n try {\n parse = util.parse(content);\n } catch (ParseException e) {\n status = new ParseStatus(e);\n }\n if (status == null || status.isSuccess()) {\n // Here we output the full page, with markup.\n output.collect(new Text(content.getUrl()), new Text(new String(content.getContent())));\n\n // Alternatively, you can also output the content as a whole object;\n // this is best when used with conf.setOutputFormat(ParseOutputFormat.class); so as to be properly serialized\n //output.collect(new Text(content.getUrl()), content);\n \n // You can also output just the page's text, sans markup.\n //output.collect(new Text(content.getUrl()), new Text(parse.getText().toString()));\n } else if (status != null) {\n output.collect(new Text(content.getUrl()), new Text(status.getMessage()));\n } else {\n output.collect(new Text(content.getUrl()), new Text("Error parsing message."));\n }\n }\n }\n\n public static class ExtractSegmentReducer extends MapReduceBase implements Reducer {\n public void reduce(WritableComparable key, Iterator values,\n OutputCollector output, Reporter reporter)\n throws IOException {\n output.collect(new Text(key.toString()), (Writable)values.next());//new Text(values.next().toString())); // collect first value\n }\n }\n}\n\n}}}\n
{{{\n/*\n * Copyright 2006 Google Inc. All Rights Reserved.\n * \n * Author: Alden King\n */\n\npackage com.google.hadoop.mapred;\n\nimport java.io.BufferedReader;\nimport java.io.IOException;\nimport java.io.InputStreamReader;\nimport java.nio.charset.Charset;\nimport java.util.ArrayList;\nimport java.util.zip.GZIPInputStream;\n\nimport org.apache.commons.logging.Log;\nimport org.apache.commons.logging.LogFactory;\nimport org.apache.hadoop.mapred.FileSplit;\nimport org.apache.hadoop.mapred.InputFormatBase;\nimport org.apache.hadoop.mapred.JobConf;\nimport org.apache.hadoop.mapred.RecordReader;\nimport org.apache.hadoop.mapred.Reporter;\nimport org.apache.hadoop.fs.FileSystem;\nimport org.apache.hadoop.fs.FSDataInputStream;\nimport org.apache.hadoop.fs.Path;\n\nimport org.apache.hadoop.io.*;\n\nimport com.google.hadoop.lib.PlainText;\n\n/** An {@link InputFormat} for gzip'd plain text files. Files are broken into lines.\n * Either linefeed or carriage-return are used to signal end of line. Keys are\n * the position in the unzip'd file, and values are the line of text.\n * Format is (LongWritable, Text)\n */\npublic class GzipTextInputFormat extends InputFormatBase {\n public static final Log LOG = LogFactory.getLog("com.google.hadoop.mapred.GzipTextInputFormat");//org.apache.hadoop.dfs.DataNode");\n\n public RecordReader getRecordReader(FileSystem fs, FileSplit split,\n JobConf job, Reporter reporter)\n throws IOException {\n\n reporter.setStatus(split.toString());\n\n final long start = split.getStart();\n final long end = start + split.getLength();\n\n final FSDataInputStream is = fs.open(split.getPath());\n\n final InputStreamReader isr = new InputStreamReader(\n new GZIPInputStream(is), Charset.forName("UTF-8"));\n // open the file and seek to the start of the split\n final BufferedReader in = new BufferedReader(isr);\n\n /* here we take advantage of each char being a byte\n * if, at some future data, this class supports more complex encodings\n * then some greater care will have to be made to this seek\n */\n /* this is the decompressed position.\n * for compressed position, use is.getPos()\n */\n long pos = 0; \n if (start >= 0) {\n /* since it's compressed, we have start from the beginning\n * and go until we get to our proper starting point\n */ \n in.mark(1);\n int intc = in.read(); // pos++\n while (intc >= 0 && is.getPos() <= start) {\n in.mark(1); /* set the position, so we can put the character back */\n intc = in.read();\n ++pos;\n }\n in.reset(); // pos--\n char c;\n while (intc >= 0 && is.getPos() < end) { // scan to next newline before end\n c = (char)intc;\n if (c == '\sr' || c == '\sn')\n break;\n intc = in.read();\n ++pos;\n }\n }\n \n final long initialPosition = pos;\n\n return new RecordReader() {\n \n private long deflate_position = initialPosition;\n \n public WritableComparable createKey() {\n return new LongWritable();\n }\n \n public Writable createValue() {\n return new Text();\n }\n \n /** Read a line. */\n public synchronized boolean next(Writable key, Writable value)\n throws IOException {\n /* We don't want to read past the end of the original compressed\n * string, since partitioning is done on the compressed file.\n * Therefore, we check the stream position, not the number of\n * expanded characters.\n */\n if (is.getPos() > end)\n return false;\n\n String line = readLine(in);\n if (line == null)\n return false;\n \n ((LongWritable)key).set(deflate_position); // key is position\n ((Text)value).set(line); // value is line\n //value = new PlainText(line);\n return true;\n }\n \n public synchronized long getPos() throws IOException {\n return deflate_position;\n }\n\n public synchronized void close() throws IOException { in.close(); }\n\n private String readLine(BufferedReader in) throws IOException {\n String str = in.readLine();\n if (str == null)\n return null;\n deflate_position += str.length() + 1; // +1 for newline // bug: assumes single-char line-endings\n return str;\n }\n\n };\n }\n}\n\n}}}\n