Table of contents
Download, unpack and move
Today (2022-01-06) the latest available version of Apache Pig is 0.17.0 (2020-07-06) and you can get it from project's webpage or directly from download subpage.
After veryfing integrity of your archive:
1 2 3 4 5 6 |
nosql@nosql:~/Pulpit/nosql2_install/pig$ pwd /home/nosql/Pulpit/nosql2_install/pig nosql@nosql:~/Pulpit/nosql2_install/pig$ md5sum pig-0.17.0.tar.gz da76998409fe88717b970b45678e00d4 pig-0.17.0.tar.gz nosql@nosql:~/Pulpit/nosql2_install/pig$ cat pig-0.17.0.tar.gz.md5 da76998409fe88717b970b45678e00d4 pig-0.17.0.tar.gz |
you can extract it:
1 2 3 4 5 6 7 |
nosql@nosql:~/Pulpit/nosql2_install/pig$ tar -zxvf pig-0.17.0.tar.gz [... CUT ...] nosql@nosql:~/Pulpit/nosql2_install/pig$ ls -l razem 225216 drwxr-xr-x 16 nosql hadoopuser 4096 cze 2 2017 pig-0.17.0 -rw-r--r-- 1 nosql hadoopuser 230606579 sty 6 16:12 pig-0.17.0.tar.gz -rw-r--r-- 1 nosql hadoopuser 52 sty 6 16:15 pig-0.17.0.tar.gz.md5 |
and move to destinate location /usr/lib/
:
1 2 3 4 |
nosql@nosql:~/Pulpit/nosql2_install/pig$ sudo mv pig-0.17.0 /usr/lib/pig [sudo] hasło użytkownika nosql: nosql@nosql:~/Pulpit/nosql2_install/pig$ ls -l /usr/lib | grep pig drwxr-xr-x 16 nosql hadoopuser 4096 cze 2 2017 pig |
Setting up the environment variable
Run text editor:
1 |
nosql@nosql:~/Pulpit/nosql2$ nano ~/.bashrc |
and paste at the end:
1 2 3 |
#Pig export PIG_HOME=/usr/lib/pig export PATH=$PATH:$PIG_HOME/bin |
After saving changes, activate the environment variables with the following command:
1 |
nosql@nosql:~/Pulpit/nosql2$ source ~/.bashrc |
Add path to Java
This step is optional, you may need it or not -- it depends on your previous instalations etc.
Locate your Java:
1 2 3 4 5 |
nosql@nosql:~/Pulpit/nosql2_install/pig$ which javac /usr/bin/javac nosql@nosql:~/Pulpit/nosql2_install/pig$ readlink -f /usr/bin/javac /usr/lib/jvm/java-11-openjdk-amd64/bin/javac nosql@nosql:~/Pulpit/nosql2_install/pig$ nano ~/.bashrc |
Run text editor:
1 |
nosql@nosql:~/Pulpit/nosql2_install/pig$ nano ~/.bashrc |
paste at the end:
1 2 |
#Java export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 |
save file and activate the environment variables with the following command:
1 |
nosql@nosql:~/Pulpit/nosql2_install/pig$ source ~/.bashrc |
Verify Pig
If you want, you can create a directory for all your Pig test and jobs. In my case this is
/home/nosql/Pulpit/nosql2/pig
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
nosql@nosql:~/Pulpit/nosql2/pig$ pig -h Apache Pig version 0.17.0 (r1797386) compiled Jun 02 2017, 15:41:58 USAGE: Pig [options] [-] : Run interactively in grunt shell. Pig [options] -e[xecute] cmd [cmd ...] : Run cmd(s). Pig [options] [-f[ile]] file : Run cmds found in file. options include: -4, -log4jconf - Log4j configuration file, overrides log conf -b, -brief - Brief logging (no timestamps) -c, -check - Syntax check [... CUT ...] 2022-01-06 16:56:47,917 INFO pig.Main: Pig script completed in 97 milliseconds (97 ms) |
Great! Pig is ready to work. Now you can make a basic test.
Create simple test file:
1 2 3 4 5 6 7 |
nosql@nosql:~/Pulpit/nosql2/pig$ echo '1,2,3 > 4,5,6 > 7,8,9' > test.txt nosql@nosql:~/Pulpit/nosql2/pig$ cat test.txt 1,2,3 4,5,6 7,8,9 |
Load and dump test data:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 |
nosql@nosql:~/Pulpit/nosql2/pig$ pig -x local 2022-01-06 17:00:11,282 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL 2022-01-06 17:00:11,282 INFO pig.ExecTypeProvider: Picked LOCAL as the ExecType 2022-01-06 17:00:11,393 [main] INFO org.apache.pig.Main - Apache Pig version 0.17.0 (r1797386) compiled Jun 02 2017, 15:41:58 2022-01-06 17:00:11,393 [main] INFO org.apache.pig.Main - Logging error messages to: /home/nosql/Pulpit/nosql2/pig/pig_1641484811384.log 2022-01-06 17:00:11,445 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/nosql/.pigbootup not found 2022-01-06 17:00:11,622 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2022-01-06 17:00:11,625 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:/// 2022-01-06 17:00:11,888 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum 2022-01-06 17:00:11,915 [main] INFO org.apache.pig.PigServer - Pig Script ID for the session: PIG-default-1da3ce9f-b596-4dee-a61f-2f1121be8912 2022-01-06 17:00:11,917 [main] WARN org.apache.pig.PigServer - ATS is disabled since yarn.timeline-service.enabled set to false grunt> A = LOAD 'test.txt' USING PigStorage(',') AS (f1:int, f2:int, f3:int); 2022-01-06 17:00:42,471 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum grunt> DUMP A; 2022-01-06 17:01:01,840 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum 2022-01-06 17:01:01,874 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN 2022-01-06 17:01:01,920 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum 2022-01-06 17:01:01,969 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NestedLimitOptimizer, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]} 2022-01-06 17:01:02,077 [main] INFO org.apache.pig.impl.util.SpillableMemoryManager - Selected heap (Tenured Gen) of size 699072512 to monitor. collectionUsageThreshold = 489350752, usageThreshold = 489350752 2022-01-06 17:01:02,135 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false 2022-01-06 17:01:02,174 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1 2022-01-06 17:01:02,174 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1 2022-01-06 17:01:02,214 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum 2022-01-06 17:01:02,313 [main] INFO org.apache.hadoop.metrics2.impl.MetricsConfig - Loaded properties from hadoop-metrics2.properties 2022-01-06 17:01:02,498 [main] INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl - Scheduled Metric snapshot period at 10 second(s). 2022-01-06 17:01:02,498 [main] INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl - JobTracker metrics system started 2022-01-06 17:01:02,561 [main] INFO org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job 2022-01-06 17:01:02,579 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.reduce.markreset.buffer.percent is deprecated. Instead, use mapreduce.reduce.markreset.buffer.percent 2022-01-06 17:01:02,582 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2022-01-06 17:01:02,585 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress 2022-01-06 17:01:02,613 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2022-01-06 17:01:02,663 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code. 2022-01-06 17:01:02,663 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche 2022-01-06 17:01:02,664 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Distributed cache not supported or needed in local mode. Setting key [pig.schematuple.local.dir] with code temp directory: /tmp/1641484862663-0 2022-01-06 17:01:02,747 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission. 2022-01-06 17:01:02,771 [JobControl] WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl - JobTracker metrics system already initialized! 2022-01-06 17:01:02,803 [JobControl] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 2022-01-06 17:01:02,914 [JobControl] WARN org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set. User classes may not be found. See Job or Job#setJar(String). 2022-01-06 17:01:02,950 [JobControl] INFO org.apache.pig.builtin.PigStorage - Using PigTextInputFormat 2022-01-06 17:01:02,965 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1 2022-01-06 17:01:02,971 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 2022-01-06 17:01:02,997 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1 2022-01-06 17:01:03,057 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1 2022-01-06 17:01:03,408 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_local1024630583_0001 2022-01-06 17:01:03,408 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - Executing with tokens: [] 2022-01-06 17:01:03,659 [JobControl] INFO org.apache.hadoop.mapreduce.Job - The url to track the job: http://localhost:8080/ 2022-01-06 17:01:03,662 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_local1024630583_0001 2022-01-06 17:01:03,662 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases A 2022-01-06 17:01:03,662 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: A[1,4],A[-1,-1] C: R: 2022-01-06 17:01:03,665 [Thread-6] INFO org.apache.hadoop.mapred.LocalJobRunner - OutputCommitter set in config null 2022-01-06 17:01:03,697 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2022-01-06 17:01:03,698 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_local1024630583_0001] 2022-01-06 17:01:03,732 [Thread-6] INFO org.apache.hadoop.conf.Configuration.deprecation - io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum 2022-01-06 17:01:03,732 [Thread-6] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2022-01-06 17:01:03,732 [Thread-6] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.reduce.markreset.buffer.percent is deprecated. Instead, use mapreduce.reduce.markreset.buffer.percent 2022-01-06 17:01:03,739 [Thread-6] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - File Output Committer Algorithm version is 2 2022-01-06 17:01:03,739 [Thread-6] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false 2022-01-06 17:01:03,740 [Thread-6] INFO org.apache.hadoop.mapred.LocalJobRunner - OutputCommitter is org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter 2022-01-06 17:01:03,826 [Thread-6] INFO org.apache.hadoop.mapred.LocalJobRunner - Waiting for map tasks 2022-01-06 17:01:03,832 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.LocalJobRunner - Starting task: attempt_local1024630583_0001_m_000000_0 2022-01-06 17:01:03,904 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - File Output Committer Algorithm version is 2 2022-01-06 17:01:03,904 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false 2022-01-06 17:01:03,956 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.Task - Using ResourceCalculatorProcessTree : [ ] 2022-01-06 17:01:03,973 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - Processing split: Number of splits :1 Total Length = 18 Input split[0]: Length = 18 ClassName: org.apache.hadoop.mapreduce.lib.input.FileSplit Locations: ----------------------- 2022-01-06 17:01:04,002 [LocalJobRunner Map Task Executor #0] INFO org.apache.pig.builtin.PigStorage - Using PigTextInputFormat 2022-01-06 17:01:04,011 [LocalJobRunner Map Task Executor #0] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader - Current split being processed file:/home/nosql/Pulpit/nosql2/pig/test.txt:0+18 2022-01-06 17:01:04,021 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - File Output Committer Algorithm version is 2 2022-01-06 17:01:04,021 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false 2022-01-06 17:01:04,055 [LocalJobRunner Map Task Executor #0] INFO org.apache.pig.impl.util.SpillableMemoryManager - Selected heap (Tenured Gen) of size 699072512 to monitor. collectionUsageThreshold = 489350752, usageThreshold = 489350752 2022-01-06 17:01:04,061 [LocalJobRunner Map Task Executor #0] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code. 2022-01-06 17:01:04,079 [LocalJobRunner Map Task Executor #0] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map - Aliases being processed per job phase (AliasName[line,offset]): M: A[1,4],A[-1,-1] C: R: 2022-01-06 17:01:04,096 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.LocalJobRunner - 2022-01-06 17:01:04,117 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.Task - Task:attempt_local1024630583_0001_m_000000_0 is done. And is in the process of committing 2022-01-06 17:01:04,130 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.LocalJobRunner - 2022-01-06 17:01:04,130 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.Task - Task attempt_local1024630583_0001_m_000000_0 is allowed to commit now 2022-01-06 17:01:04,138 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - Saved output of task 'attempt_local1024630583_0001_m_000000_0' to file:/tmp/temp-1316990218/tmp446550113 2022-01-06 17:01:04,140 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.LocalJobRunner - map 2022-01-06 17:01:04,140 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.Task - Task 'attempt_local1024630583_0001_m_000000_0' done. 2022-01-06 17:01:04,145 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.Task - Final Counters for attempt_local1024630583_0001_m_000000_0: Counters: 15 File System Counters FILE: Number of bytes read=442 FILE: Number of bytes written=592271 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 Map-Reduce Framework Map input records=3 Map output records=3 Input split bytes=370 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=5 Total committed heap usage (bytes)=63381504 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=0 2022-01-06 17:01:04,146 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.LocalJobRunner - Finishing task: attempt_local1024630583_0001_m_000000_0 2022-01-06 17:01:04,147 [Thread-6] INFO org.apache.hadoop.mapred.LocalJobRunner - map task executor complete. 2022-01-06 17:01:04,202 [main] WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl - JobTracker metrics system already initialized! 2022-01-06 17:01:04,212 [main] WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl - JobTracker metrics system already initialized! 2022-01-06 17:01:04,214 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 2022-01-06 17:01:04,214 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 2022-01-06 17:01:04,215 [main] WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl - JobTracker metrics system already initialized! 2022-01-06 17:01:04,245 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2022-01-06 17:01:04,247 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features 3.3.1 0.17.0 nosql 2022-01-06 17:01:02 2022-01-06 17:01:04 UNKNOWN Success! Job Stats (time in seconds): JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime MaxReduceTime MinReduceTimeAvgReduceTime MedianReducetime Alias Feature Outputs job_local1024630583_0001 1 0 n/a n/a n/a n/a 0 0 0 0 A MAP_ONLY file:/tmp/temp-1316990218/tmp446550113, Input(s): Successfully read 3 records from: "file:///home/nosql/Pulpit/nosql2/pig/test.txt" Output(s): Successfully stored 3 records in: "file:/tmp/temp-1316990218/tmp446550113" Counters: Total records written : 3 Total bytes written : 0 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0 Job DAG: job_local1024630583_0001 2022-01-06 17:01:04,250 [main] WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl - JobTracker metrics system already initialized! 2022-01-06 17:01:04,252 [main] WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl - JobTracker metrics system already initialized! 2022-01-06 17:01:04,254 [main] WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl - JobTracker metrics system already initialized! 2022-01-06 17:01:04,273 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! 2022-01-06 17:01:04,277 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum 2022-01-06 17:01:04,278 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized 2022-01-06 17:01:04,285 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1 2022-01-06 17:01:04,285 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 (1,2,3) (4,5,6) (7,8,9) grunt> quit 2022-01-06 17:10:46,123 [main] INFO org.apache.pig.Main - Pig script completed in 10 minutes, 35 seconds and 167 milliseconds (635167 ms) |