grepコマンドで基本的なテキスト処理をまとめてみた件 - 京橋のバイオインフォマティシャンの日常

はじめに
grep コマンドの基本
文字列を含む・含まないファイル名の操作
まとめ
- 参考文献

はじめに

grep コマンドは、AppleのHPにファイル内の文字列を探すツールであると説明されているが、 grep コマンドを知ると、単に「文字列を探す」だけでなく、いろいろな応用的な操作ができるようになる。

この記事では、Mac版 grepコマンドの基本的なテキスト処理をまとめてみた。

前準備

適当な作業フォルダを作成して、同フォルダ内に移動する。

mkdir TEST     # TESTフォルダの作成
cd TEST        # TESTフォルダへの移動

which grep     # grepのパスを見ておく
#/usr/bin/grep

サンプルテキスト（text.txt）をGitHubからダウンロードする

wget https://raw.githubusercontent.com/kumeS/Blog/master/200503_Files_for_grep_command/test.txt

wget のインストールについては、過去の記事を参考のこと。。

skume.hatenablog.com

まず、test.txt ファイルの中を表示してみると

cat test.txt

#a
#b
#b
#c
#c
#c
#abc
#cba
#acb
#abcd

と、a,bc, abcなど、それぞれの行に文字列が書かれている。

bとcはそれぞれ２つ、３つある。

grep コマンドの基本

まずは、ファイル内の文字列操作に関するコマンドを紹介する。

「a」が含まれる行を抽出して、別ファイルで保存する grep コマンド

grep "a" test.txt > test01.txt   
# or grep -e "a" test.txt > test01.txt でも可
# or grep a test.txt > test01.txt でも可

cat  test01.txt
#a
#abc
#cba
#acb
#abcd

「a」が含まれない行を抽出して、別ファイルで保存する grep コマンド

grep -v "a" test.txt > test02.txt

cat  test02.txt
#b
#b
#c
#c
#c

ここでは、「-v (--invert-match)」オプションを使う。

上記を１行で書くと、「; (セミコロン)」あるいは「| (パイプ)」を使ってこうなる。

grep -v "a" test.txt > test02.txt ; cat test02.txt

# 表示だけなら
grep -v "a" test.txt | cat

「a」が先頭の行を抽出して、別ファイルで保存する grep コマンド

grep "^a" test.txt > test03.txt ; cat  test03.txt

#a
#abc
#acb
#abcd

「a」が末尾の行を抽出して、別ファイルで保存する grep コマンド

grep "a$" test.txt > test04.txt ; cat  test04.txt
#a
#cba

「a」のみの行を抽出して、別ファイルで保存する grep コマンド

grep "^a$" test.txt > test05.txt ; cat  test05.txt
#a

（AND検索）「2文字以上の文字列」かつ「末尾の文字が c 」の行を抽出して、別ファイルで保存する grep コマンド

grep ".." test.txt | grep "c$" > test06.txt ; cat  test06.txt
#abc

例えば、"....."(ドットが５つ)の場合は、５文字以上の文字列を意味する

（OR検索）「末尾の文字が bc 」あるいは「末尾の文字が cd 」の行を抽出して、別ファイルで保存する grep コマンド

grep -e "bc" -e "cd" test.txt > test07.txt ; cat  test07.txt
#abc
#abcd

文字列を含む・含まないファイル名の操作

ディレクトリ内のtxtファイルで「a」のみの行を含むファイルのファイル名を表示する grep コマンド

grep -l "^a$" *.txt | cat
#test.txt
#test01.txt
#test03.txt
#test04.txt
#test05.txt

ディレクトリ内のtxtファイルで「a」のみの行を含まないファイルのファイル名を表示する grep コマンド

grep -L "^a$" *.txt | cat
#test02.txt
#test06.txt
#test07.txt

ディレクトリ内のtxtファイルで「a」のみの行を含むファイル数をカウントする grep コマンド

grep -l "^a$" *.txt | wc -l
#       5

まとめ

とりあえずは、このあたりのテキスト処理ができれば良いかと思う。

次回、実践編として、ギガバイト(GB)サイズのテキストデータを扱う事例を紹介したい。