Pig Latin - Query language

Pig Latin is an open source query language created in 2008.

#213on PLDB

17Years Old

1kRepos

Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark. Read more on Wikipedia...

Tags: queryLanguage
There are at least 1,347 Pig Latin repos on GitHub
Early development of Pig Latin happened in Apache Software Foundation
The Google BigQuery Public Dataset GitHub snapshot shows 535 users using Pig Latin in 606 repos on GitHub
CodeMirror package for syntax highlighting Pig Latin
Pygments supports syntax highlighting for Pig Latin
GitHub supports syntax highlighting for Pig Latin
See also: (8 related languages) Linux, Java, SQL, Python, JavaScript, Ruby, Groovy, Sawzall
3 PLDB concepts link to Pig Latin: Ace Editor, cloc, Pygments

Example from the web:

input_lines = LOAD '/tmp/word.txt' AS (line:chararray);
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
filtered_words = FILTER words BY word MATCHES '\\w+';
word_groups = GROUP filtered_words BY word;
word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;
ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO '/tmp/results.txt';

Example from hello-world:

Hello WorldPIGHello World

Example from Linguist:

/**
 * sample.pig
 */

REGISTER $SOME_JAR;

A = LOAD 'person' USING PigStorage() AS (name:chararray, age:int); -- Load person
B = FOREACH A generate name;
DUMP B;

Example from Wikipedia:

input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);
 
 -- Extract words from each line and put them into a pig bag
 -- datatype, then flatten the bag to get one word on each row
 words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
 
 -- filter out any words that are just white spaces
 filtered_words = FILTER words BY word MATCHES '\\w+';
 
 -- create a group for each word
 word_groups = GROUP filtered_words BY word;
 
 -- count the entries in each group
 word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;
 
 -- order the records by count
 ordered_word_count = ORDER word_count BY count DESC;
 STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';

Language features

Feature	Supported	Example	Token
Integers	✓	-- [0-9]+L?
Floats	✓	-- [0-9]*\.[0-9]+(e[0-9]+)?[fd]?
Hexadecimals	✓	-- 0x[0-9a-f]+
MultiLine Comments	✓	/* A comment */	/* */
Comments	✓	-- A comment
Line Comments	✓	-- A comment	--
Semantic Indentation	X