This perl (works with perl 4 or perl 5) script checks data files. Prints part of file for examination. Assumes datafile has newlines, unless -l flag used. Report goes to stdout. Reports: - minimum and maximum record lengths; - number of records of each record length; - for records of unique record length reports the record number; - "bad" characters found. By default, prints first 70 columns of first 5 and last 5 records. By default reads filenames on command line. Flags: -c# "count": print # data lines at beginning and end of report -f# force cio to processs # of "bad" records -h prints this help screen. -l# give lrecl for files with no newlines. -m show man page. -n# prints 1st 5 lines and every nth record. -q quick mode: prints only 5 lines, (opt.: first nth record) and rpt. -s take input from stdin ---------------------------------------------------------------------- #!/opt/perl5/bin/perl # perl4 #!/usr/local/bin/perl # perl5.001 #!/usr/local/bin/perl5 # perl5.004_1 #!/opt/perl5/bin/perl 'di'; 'ig00'; # # $Header$ # # $Log$ $using_wrapman = 1; ###################################################################### # "cio" (Check It Out) # Date: 95/07/24 # Author: jajacobs #------------------------------------------------------------------ # Jim Jacobs # Central University Library, Mail Code 0175-R # Social Sciences Database Project # University of California, San Diego # San Diego, California 92122 # (619) 534-1262 #------------------------------------------------------------------ # This version reads filenames from command line and reads from # stdin when -s flag is used. It reports "bad" characters, tabs, and # alphabetic, too. # This version has -c "count" option that specifies how many lines to # print at beginning and end. # # This version works with # perl 4.0 patch level 35 # perl 4.0 patch level 36 # perl 5.000 # perl 5.001 ###################################################################### ###################################################################### # set up vars. # command line options: # c# "count" show # lines at beginning and end # f# force cio to processs # of "bad" records # h show help screen # l logical record length (no newlines) # m show man page # n# show ever #th record # q quick option: only first 5 and last 5 recs examined # s take input from stdin ###################################################################### require "getopts.pl"; &Getopts('n:l:f:c:e:hqms') ; if ($#ARGV == 1) { print "$#ARGV\n"; &help; } if ($opt_h) { &help; } if ($opt_m) { &Usage(); } if ($opt_c) { $end = $opt_c + 1 ; $count = $opt_c; } else { $end = 6; $count = 5; } if ($opt_f) { $max_bad = $opt_f; } else { $max_bad = 5; } $width = 75; $rec = 0; $maxrec = 0; $minrec = 99999; $n = 1; $f = 1; $a = 0; ###################################################################### # process... # if $opt_s then we're taking input from stdin ###################################################################### if ($opt_s) { ###################################################################### # if $opt_l then we're looking at a lrecl file with no newlines ###################################################################### if ($opt_l) { print "Using logical record length of $opt_l\n"; &print_header; while (read(STDIN, $tmp, $opt_l)) { $rec++; if ($opt_q) { &quick_opt; } else { &call_bad; } } # end of reading all recs from stdin lrcl &call_last; } #end of lrecl from stdin loop ###################################################################### # if not $opt_l then we're looking at a newline-delimited file ###################################################################### else { print "Reading file with newlines.\n" ; print "Record lengths reported do NOT include the newline character.\n"; print "------------------------------------------------\n"; &print_header; while ($tmp = ) { chop $tmp; $rec++; if ($opt_q) { &quick_opt; } else { &call_bad; } } # end of while tmp = stdin &call_last; } # end of newline processing from stdin } # end of opt_s ###################################################################### # if not $opt_s then we're reading filenames from command line ###################################################################### else { while (@ARGV) { $file = shift @ARGV; if (-e $file) { if (open (FILE, "$file") ) { print "EXAMINING FILE: $file ######################################################################\n"; if ($opt_l) { print "Using logical record length of $opt_l\n"; &print_header; while (read(FILE, $tmp, $opt_l)) { $rec++; if ($opt_q) { &quick_opt; } else { # not opt_q &call_bad; } # end of not opt_q else loop } # end of while read FILE &call_last; ############################################################## # if not opt_l then we're looking at a newline delimited file. ############################################################## }else { print "Reading file with newlines.\n" ; print "Record lengths reported do NOT include the newline character.\n"; print "------------------------------------------------\n"; &print_header; while ($tmp = ) { chop $tmp; $rec++; if ($opt_q) { &quick_opt; } else { # end of opt_q loop &call_bad; } # end of reading recs from FILE with newline } # end of while tmp = FILE &call_last; } # end of else loop not opt_l reading newline files } else { print STDERR "cant' open file $file.\n"; } }else { print STDERR "File $file does not exist.\n" } } # end of while ARGV } # end of reading file from command line ###################################################################### sub print_header { $~ = "STUFF"; print " 10 20 30 40 50 60 \n"; print "------ +----*----+----*----+----*----+----*----+----*----+----*----+----*----+\n"; } ###################################################################### # sub for quick option ###################################################################### sub quick_opt { if ($rec == ($end + $opt_n) ) { &tail_loop; } else { &call_bad; } } ###################################################################### # sub bad_add called by sub bad_chars ###################################################################### sub bad_add { $one_bad = 1; $bad_n++; $bad_char_rec[$bad_count] = $tmp; $stuff = unpack (C,$&); $bad_rec[$bad_n] = $rec; $bad_char[$bad_n] = $stuff ; } # end of sub bad_add ###################################################################### # subroutine for sorting. ###################################################################### sub sortlengths { local ($x1, $y1) = split(/:/, $a); local ($x2, $y2) = split(/:/, $b); $y1 <=> $y2; } ###################################################################### # sub to read call bad chars or bail ###################################################################### sub call_bad { &length; if ($bad_count < $max_bad) { &bad_chars; } elsif ($e_bail) { &end_rpt_e_bail; } else { &end_rpt_bail; } } ###################################################################### # sub to do last recs or bail ###################################################################### sub call_last { if ($bail){ $bail = 0; } else { &last_recs; &end_rpt; } } ###################################################################### # sub for writing in opt_q. ###################################################################### sub q_write { $rec++; &call_bad; $LLine = substr($tmp, 0,($width -5)); if (length $tmp > ($width - 5)) { $LLine = $LLine . '>'; } else { $LLine = $LLine . '|'; } if (!$bail) { write; } } ###################################################################### # if too many bad records encountered... bail... ###################################################################### sub end_rpt_bail { $bail = 1; print " > -------------------------------------------------- > cio ceased processing because it encountered $bad_count > records with bad characters. (See list of bad > characters and where they were found in the report > below.) If you wish to reanalyze the data and let > cio process more records, use the -f flag to specify > how many bad records to process before ceasing. > ---------------------------------------------- \n"; if ($file) { close FILE; } else { close STDIN; } &end_rpt; } ###################################################################### # if too many ebcdics bail... ###################################################################### sub end_rpt_e_bail { print " > -------------------------------------------------- > cio ceased processing after $rec records because > $e_percent% of the $total characters analyzed appear to be > EBCDIC, not ASCII. If you wish to reanalyze the > data and let cio process more records, use the -f > flag to specify how many bad records to process > before ceasing. > ---------------------------------------------- \n"; if ($file) { close FILE; } else { close STDIN; } &end_rpt; } ###################################################################### # do end report ###################################################################### sub end_rpt { ################################################## # if we bailed because we were using opt_l # and we read more than 32768 bytes without encountering # newline char, skip the ending printing... # call this the no_newline exclusion ################################################## if (!$no_newline) { $~ = 'STDOUT'; ################################################## # print bottom ruler, min and max rec lengths # and number of records checked or found ################################################## print "------ +----*----+----*----+----*----+----*----+----*----+----*----+----*----+\n"; print " 10 30 30 40 50 60 \n"; print "\nMax rec len: ", $maxrec, "\n"; print "Min rec len: ", $minrec, "\n"; if ($opt_q||$bail||$e_bail) { print "Total number of records checked: ", $rec, "\n"; } else { print "Total number of records: ", $rec, "\n" ; } print "\n"; ###################################################################### # make a new array $l to store both the length of records and # the number of records of each length. ###################################################################### $i = 0; foreach $x (keys %length) { $l[$i++] = "$x:$length{$x}"; } ###################################################################### # sort the records/length array @l to array @sorted # use reverse sort so that final array will print with most frequent # record size first, least frequent record size last. ###################################################################### @sorted = reverse sort sortlengths @l ; ###################################################################### # split the @sorted array to get its separate values for the # record size and numbers of records of that size. # # write the records-sizes and count of record-sizes. # use stdout format to print counts of recs of each size # use ONE format to write if $s_count == 1. ###################################################################### $i = 0; while ( $i < $a) { ($s_length, $s_count) = split(/:/,$sorted[$i]); if ($s_count == 1) { $~ = ONE ; write ; $sorted[$i] = ""; $l[$i] = ""; $i++; } else { $~ = STDOUT; write STDOUT; $sorted[$i] = ""; $l[$i] = ""; $i++; } } ############################################################ # print notes on lowercase, uppercase and tabs if found ############################################################ print "\n"; if ($lowercase) { print "NOTE: Data has lowercase characters beginning with record: $lowercase_rec:\n----\n$lowercase_\n----\n"; } if ($uppercase) { print "NOTE: Data has uppercase characters beginning with record: $uppercase_rec:\n----\n$uppercase_\n----\n"; } if ($tab) { print "NOTE: This file has tabs beginning in record $tab_rec: \n---\n$tab_\n---\n"; } $g = 1; ############################################################ # print report on bad chars found, if any ############################################################ if ($e_count) { if ($e_bail) { ; } else { if ($total) { $e_percent = ($e_count / $total) * 100; if ($e_percent >= 90) { $bail = 1; print "########### NOTE: $e_percent% of the $total characters examined appear to be EBCDIC, not ASCII. If you wish to re-analyze this file and let cio process more records, use the -f flag to force cio to analze this file anyway. Use the -f flag to specify how many records to analyze before ceasing.\n"; &clean_up } } } } # end of e_count option if ($bad_n) { while ($g < ($bad_n+1) ) { printf ("NOTE: Record %d has a bad character: (octal:%3o,dec:%3d,hex:%3x)\n", $bad_rec[$g], $bad_char[$g], $bad_char[$g],$bad_char[$g]); $bad_rec[$g] = ""; $bad_char[$g] = ""; $g++; } } ############################################################ # print message if the file looks like ebcdic ############################################################ if ($e_count) { if ($e_count == 1) { print "NOTE: out of a total of $total characters checked, there is $e_count characters that may be EBCDIC\n"; } else { print "NOTE: out of a total of $total characters checked, there are $e_count characters that may be EBCDIC\n"; } } ############################################################ # if no bad chars found, print good news ############################################################ if (!$bad_n && !$e_bail) { if ($opt_q) { print "NOTE: Good News! cio didn't find any bad characters in the $rec records it examined!\n"; } else { print "NOTE: Good News! cio didn't find any bad characters!\n"; } } ############################################################ # if we're reading files named on command line, # print that we're done with this one. ############################################################ if ($file) { print "\n###################################################################### END OF REPORT FOR FILE $file\n"; } } # end of no_newline exclusion &clean_up; } # end of sub end_rpt ############################################################################### # sub clean_up resets values in case there is another file to examine. ############################################################################### sub clean_up { if ($opt_f) { $max_bad = $opt_f; } else { $max_bad = 5; } if ($opt_c) { $end = $opt_c + 1 ; $count = $opt_c; } else { $end = 6; $count = 5; } $width = 75; $rec = 0; $maxrec = 0; $minrec = 99999; $n = 1; $f = 1; $a = 0; $b = 0; $bad_n = 0; $bad_count = 0; $e_count = 0; $i = 0; $len = 0; $lowercase = 0; $lowercase_rec = 0; $lowercase_ = 0; $no_newline = 0; $one_bad = 0; $s_count = 0; $s_length = 0; $tab = 0; $tab_rec = 0; $tab_ = 0; $tailing = 0; $total = 0; $uppercase = 0; $uppercase_rec = 0; $uppercase_ = 0; while ($i <= $end) { $lastrec[$i] = ""; $lastline[$i] = ""; $i++; } foreach $key (keys %length) { delete $length{$key}; } foreach $key (keys %onerec) { delete $onerec{$key}; } foreach $key (keys %bad_char_rec) { delete $bad_char_rec{$key}; } } #end of sub clean_up ###################################################################### # sub for writing last records ###################################################################### sub last_recs { if (!$opt_q) { $n--; if ($n == $end) { $f = 1; } else { $f = $n +1; } if ($opt_c) { print "\nLAST $opt_c RECORDS: \n\n"; } else { print "\nLAST ", $end-1, " RECORDS:\n\n"; } ################################################################# # set up STUFF2 as output format for write. ################################################################# $~ = "STUFF2"; $z = 1; while ( $z < $end) { write ; $z++; $f++; if ($f == $end) { $f = 1; } } } #end of writing last recs } #end of last_recs sub ###################################################################### # sub tail_loop ###################################################################### sub tail_loop { $tailing = 1; $~ = "STUFF3"; if ($opt_l) { $offset = $opt_l * $count; print "\nLAST $count RECORDS (record numbers inaccurate) (lrecls from end of file)\n"; if ($opt_s) { seek (STDIN, -$offset, 2); while (read (STDIN, $tmp, $opt_l)) { &q_write; } } else { # if not opt_s seek (FILE, -$offset, 2); while (read (FILE, $tmp, $opt_l)) { &q_write; } } } else { $last_stuff = 0; $offset = ($maxrec + 1) * ($count +1); print "\nLAST FEW RECORDS (record numbers inaccuate)\n\n"; if ($opt_s) { seek (STDIN, -$offset, 2); while ($tmp = ) { chop $tmp; if ($last_stuff) { &q_write; } else { $last_stuff = 1; } } } else { # not opt_s seek (FILE, -$offset, 2); while ($tmp = ) { chop $tmp; if ($last_stuff) { &q_write; } else { $last_stuff = 1; } } } } } #end of tail loop ###################################################################### # examine records for length and for printing samples recs ###################################################################### sub length { ############################################################ # get length of current record ############################################################ $len = length($tmp); if (!$opt_l&&!$opt_f) { if ($rec == 1) { if ($len > 32768) { print " > -------------------------------------------------- > cio ceased processing because $len bytes were > read without encountering a newline character. > If the datafile does have newlines and the records > are longer than 32,768, use the -f flag to force > cio to analyze those longer records; (give the -f > flag any numeric argument to make it do this). > If the datafile does not have newline characters, > use the -l flag to specify the logical record > length of the data. > ---------------------------------------------- \n"; $bail = 1; $no_newline = 1; &end_rpt; } } } if (!$bail) { ########################################################## # check for minimum and maximum record lengths ########################################################### $minrec = $len < $minrec ? $len : $minrec; $maxrec = $len > $maxrec ? $len : $maxrec; ########################################################### # count ($length) the number of records of this length ($len) # note: associative array. ########################################################## $length{$len}++; ############################################################ # if this is the first record of this length, keep track of the record # number of this record as $onerec[$len] # Increment $a to keep track of how many unique record lengths there # are in this file. ############################################################ if ($length{$len} == 1) { if ($tailing) { $onerec[$len] = "last few" ; $a++; } else { $onerec[$len] = $rec ; $a++; } } ###################################################################### # check for 'n' value and if on first $count records. # (if the record number is not a multiple of opt_n, don't write the line.) # (if the record number is in first $count or if it is a multiple of # opt_n, write the line.) # create a short record $line for writing. # check for length of original line and choose appropriate end of line # character. ###################################################################### if ($opt_n) { $lastline[$n] = substr($tmp, 0, ($width - 5)); $lastrec[$n] = $rec; if (length $tmp > ($width - 5)) { $lastline[$n] = $lastline[$n].'>'; } else { $lastline[$n] = $lastline[$n] . '|'; } unless ( $rec % $opt_n) { $line = $lastline[$n]; write; } elsif ($rec <= $count) { $line = $lastline[$n]; write; } $n++; if ($n == $end) { $n = 1; } } else { $lastline[$n] = substr($tmp, 0, ($width - 5)); $lastrec[$n] = $rec; if (length $tmp > ($width - 5)) { $lastline[$n] = $lastline[$n] ; $lastline[$n] = $lastline[$n] . '>'; } else { $lastline[$n] = $lastline[$n] ; $lastline[$n] = $lastline[$n] . '|'; } if ($rec <= $count) { $line = $lastline[$n] ; write; } $n++; if ($n == $end) { $n = 1; } } } # end of if not bail } # end of sub length ###################################################################### # examine records for bad bytes. ###################################################################### sub bad_chars { $text = "has a bad character: "; $total = $total + $len ; if (!$lowercase) { if ($tmp =~ (/[a-z]/) ) { $lowercase = 1; $lowercase_rec = $rec; $lowercase_ = $tmp ; } } if (!$uppercase) { if ($tmp =~ (/[A-Z]/) ) { $uppercase = 1; $uppercase_rec = $rec; $uppercase_ = $tmp ; } } if (!$tab) { if ($tmp =~ (/\011/) ) { $tab = 1; $tab_rec = $rec; $tab_ = $tmp; } } if ($tmp =~ (/[^a-zA-Z0-9_ ]/) ) { if ($tmp =~ (/[\000-\037]/) ) { if ($tmp =~ (/\000/) ) { &bad_add;} if ($tmp =~ (/\001/) ) { &bad_add;} if ($tmp =~ (/\002/) ) { &bad_add;} if ($tmp =~ (/\003/) ) { &bad_add;} if ($tmp =~ (/\004/) ) { &bad_add;} if ($tmp =~ (/\005/) ) { &bad_add;} if ($tmp =~ (/\006/) ) { &bad_add;} if ($tmp =~ (/\007/) ) { &bad_add;} if ($tmp =~ (/\010/) ) { &bad_add;} if ($tmp =~ (/\012/) ) { &bad_add;} if ($tmp =~ (/\013/) ) { &bad_add;} if ($tmp =~ (/\014/) ) { &bad_add;} if ($tmp =~ (/\015/) ) { &bad_add;} if ($tmp =~ (/\016/) ) { &bad_add;} if ($tmp =~ (/\017/) ) { &bad_add;} if ($tmp =~ (/\020/) ) { &bad_add;} if ($tmp =~ (/\021/) ) { &bad_add;} if ($tmp =~ (/\022/) ) { &bad_add;} if ($tmp =~ (/\023/) ) { &bad_add;} if ($tmp =~ (/\024/) ) { &bad_add;} if ($tmp =~ (/\025/) ) { &bad_add;} if ($tmp =~ (/\026/) ) { &bad_add;} if ($tmp =~ (/\027/) ) { &bad_add;} if ($tmp =~ (/\030/) ) { &bad_add;} if ($tmp =~ (/\031/) ) { &bad_add;} if ($tmp =~ (/\032/) ) { &bad_add;} if ($tmp =~ (/\033/) ) { &bad_add;} if ($tmp =~ (/\034/) ) { &bad_add;} if ($tmp =~ (/\035/) ) { &bad_add;} if ($tmp =~ (/\036/) ) { &bad_add;} if ($tmp =~ (/\037/) ) { &bad_add;} } #end of 000 to 037 loop if ($tmp =~ (/[\177-\377]/) ) { if ($tmp =~ (/\177/) ) { &bad_add;} if ($tmp =~ (/\200/) ) { &bad_add;} if ($tmp =~ (/\201/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\202/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\203/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\204/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\205/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\206/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\207/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\210/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\211/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\212/) ) { &bad_add;} if ($tmp =~ (/\213/) ) { &bad_add;} if ($tmp =~ (/\214/) ) { &bad_add;} if ($tmp =~ (/\215/) ) { &bad_add;} if ($tmp =~ (/\216/) ) { &bad_add;} if ($tmp =~ (/\217/) ) { &bad_add;} if ($tmp =~ (/\220/) ) { &bad_add;} if ($tmp =~ (/\221/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\222/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\223/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\224/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\225/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\226/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\227/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\230/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\231/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\232/) ) { &bad_add;} if ($tmp =~ (/\233/) ) { &bad_add;} if ($tmp =~ (/\234/) ) { &bad_add;} if ($tmp =~ (/\235/) ) { &bad_add;} if ($tmp =~ (/\236/) ) { &bad_add;} if ($tmp =~ (/\237/) ) { &bad_add;} if ($tmp =~ (/\240/) ) { &bad_add;} if ($tmp =~ (/\241/) ) { &bad_add;} if ($tmp =~ (/\242/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\243/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\244/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\245/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\246/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\247/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\250/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\251/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\252/) ) { &bad_add;} if ($tmp =~ (/\253/) ) { &bad_add;} if ($tmp =~ (/\254/) ) { &bad_add;} if ($tmp =~ (/\255/) ) { &bad_add;} if ($tmp =~ (/\256/) ) { &bad_add;} if ($tmp =~ (/\257/) ) { &bad_add;} if ($tmp =~ (/\260/) ) { &bad_add;} if ($tmp =~ (/\261/) ) { &bad_add;} if ($tmp =~ (/\262/) ) { &bad_add;} if ($tmp =~ (/\263/) ) { &bad_add;} if ($tmp =~ (/\264/) ) { &bad_add;} if ($tmp =~ (/\265/) ) { &bad_add;} if ($tmp =~ (/\266/) ) { &bad_add;} if ($tmp =~ (/\267/) ) { &bad_add;} if ($tmp =~ (/\270/) ) { &bad_add;} if ($tmp =~ (/\271/) ) { &bad_add;} if ($tmp =~ (/\272/) ) { &bad_add;} if ($tmp =~ (/\273/) ) { &bad_add;} if ($tmp =~ (/\274/) ) { &bad_add;} if ($tmp =~ (/\275/) ) { &bad_add;} if ($tmp =~ (/\276/) ) { &bad_add;} if ($tmp =~ (/\277/) ) { &bad_add;} if ($tmp =~ (/\300/) ) { &bad_add;} if ($tmp =~ (/\301/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\302/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\303/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\304/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\305/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\306/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\307/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\310/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\311/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\312/) ) { &bad_add;} if ($tmp =~ (/\313/) ) { &bad_add;} if ($tmp =~ (/\314/) ) { &bad_add;} if ($tmp =~ (/\315/) ) { &bad_add;} if ($tmp =~ (/\316/) ) { &bad_add;} if ($tmp =~ (/\317/) ) { &bad_add;} if ($tmp =~ (/\320/) ) { &bad_add;} if ($tmp =~ (/\321/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\322/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\323/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\324/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\325/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\326/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\327/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\330/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\331/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\332/) ) { &bad_add;} if ($tmp =~ (/\333/) ) { &bad_add;} if ($tmp =~ (/\334/) ) { &bad_add;} if ($tmp =~ (/\335/) ) { &bad_add;} if ($tmp =~ (/\336/) ) { &bad_add;} if ($tmp =~ (/\337/) ) { &bad_add;} if ($tmp =~ (/\340/) ) { &bad_add;} if ($tmp =~ (/\341/) ) { &bad_add;} if ($tmp =~ (/\342/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\343/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\344/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\345/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\346/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\347/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\350/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\351/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\352/) ) { &bad_add;} if ($tmp =~ (/\353/) ) { &bad_add;} if ($tmp =~ (/\354/) ) { &bad_add;} if ($tmp =~ (/\355/) ) { &bad_add;} if ($tmp =~ (/\356/) ) { &bad_add;} if ($tmp =~ (/\357/) ) { &bad_add;} if ($tmp =~ (/\360/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\361/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\362/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\363/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\364/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\365/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\366/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\367/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\370/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\371/) ) { $e_count++; &bad_add;} if ($tmp =~ (/\372/) ) { &bad_add;} if ($tmp =~ (/\373/) ) { &bad_add;} if ($tmp =~ (/\374/) ) { &bad_add;} if ($tmp =~ (/\375/) ) { &bad_add;} if ($tmp =~ (/\376/) ) { &bad_add;} if ($tmp =~ (/\377/) ) { &bad_add;} } # end of 177-377 } #end of looking for not a-z not A-Z not 0-9 not " _" if ($rec >= $max_bad) { if ($total) { $e_percent = ($e_count / $total) * 100; if ($e_percent >= 90) { $bail = 1; $e_bail = 1; } } } if ($one_bad) { $bad_count++; $one_bad = 0; } } # end of sub bad_chars ###################################################################### #sub help prints usage to screen. ###################################################################### sub help { print <>>>>>>>>>> records of length @>>>>>>>>>. First one is: @<<<<<<<<< $s_count, $s_length, $onerec[$s_length] . format ONE = There is @>>>>>>>>>>> record of length @>>>>>>>>>. Rec. number: @<<<<<<<<< $s_count, $s_length, $onerec[$s_length] . format STUFF = @>>>>> |@<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $rec, $line . format STUFF2 = @>>>>> |@<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $lastrec[$f], $lastline[$f] . format STUFF3 = @>>>>> |@<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $rec, $LLine . #******************************************************************************* # Name : Usage # Purpose : Prints out usage message and bails # Arguments : 1 - String containing useful error message. # Return Value : None # Calls : Exit # Globals Accessed : # Notes : #******************************************************************************* sub Usage { local($pager); $, = ": "; ####################################################################### # insert check for which more which less if exist more, etc... # here... ####################################################################### $w_pager = `which more` ; $pager = ($ENV{'MANPAGER'} || $ENV{'PAGER'} || $w_pager || '/usr/bin/more'); if (@_) { print STDERR $progName, @_; print "\n"; print "Hit for manpage, interrupt to abort..."; $_ = ; } exec("nroff -man $0 | $pager"); } ############################################################### # These next few lines are legal in both Perl and nroff. .00; # finish .ig 'di \" finish diversion--previous line must be blank .nr nl 0-1 \" fake up transition to first page again .nr % 0 \" start at page 1 '; __END__ ##### From here on it's a standard manual page ##### .TH cio 1 "January 17, 1994" .AT 3 .SH NAME cio \- "Check It Out" -- examines datafiles and reports on contents. .SH SYNOPSIS .B cio [-cfhlmnqs] .SH DESCRIPTION .I cio This /usr/bin/perl script checks a datafile and creates a report on the contents of the datafile (see below). .B Features: - examines files with newline delimited records or files with all records of the same size and no newline characters ("logical record length" format). - can read files from stdin or from command line. when reading from command line can accept wild card characters and multilple names and examine multiple files. - reports: .RS 10 - number of records; - record lengths (minimum, maximum and unique); - files that have no newline characters and extremely long record lengths. - occurrences of lowercase and uppercase alphabetics; - occurrrences of tabs; - occurences of "bad" characters and the record number and byte value of the character (in hex, octal and decimal). - files that appear to be EBCDIC instead of ASCII. .RE 10 - displays: .RS 10 - first 70 characters of first and last 5 records - records with "bad" characters. .RE 10 cio defaults: .RS 10 - reads file names from command line; - assumes records are delimited with newlines; - displays the first and last 5 lines of the file; - writes report to stdout; - reports to stderr any files that cannot be opened or found; - ceases processing after finding 5 records with "bad" characters or 5 records that appear to be in ebcdic or if no newlines are encountered in 32,768 bytes. .RE 10 The above defaults can all be changed with command line flags. Cio assumes datafile has newlines, unless -l flag is used (see below). The terms "record" and "line" are used synonymously in this description of cio. Cio creates a report on the datafile it examines and writes this report to stdout. The report includes: the minimum and maximum record lengths in a datafile, the number of records of each length, and, for records that have a unique length, it reports the record number. The report also prints part of file for examination. By default the report prints the first 70 columns of the first 5 records and the last 5 records. The report prints the "|" at the end of each record of 70 or fewer characters and a ">" at the end of lines of record length 71 and longer. Additional lines may be printed by using the .B -e and .B -n flags (see below). .SH OPTIONS -c# "count": print # data lines at beginning and end of report. -f# Forces cio to processs # of "bad" records. -h Prints short help message. -l By default, cio assumes that a file being examine has a newline character delimiting each record. For those files that do not have newline characters but that do have fixed length records, the -l flag followed by a number may be used to specify the "logical record length." The number following the -l is the record length used by cio to examine the file. A number must be specified. -m Prints this man page. -n# The report includes the first 70 characters of the first and last 5 records automatically. Additional records may be included by using the -n option followed by a number. Cio will then print every nth record in the file as specified by the number following the -n flag. A number must be specified. -q Because large files may take a few minutes to process, you may sometimes want to use the -q flag for the "quick" option. It produces a report with the first 5 (or -c#) records of the file, plus the first -n record (if specified) and the last 5 (or -c #) records of the file. Its report of record lengths and bad characters is based only on its examination of these records, not the entire file. NOTE that the last records examined and displayed in the report are based on a) reading from the end of the data file and b) reading the number of bytes that cio anticipates to be about 5 (or the number specified with the -e options) records from the end. If a file with newline delimited records has very irregular record lengths, cio may not pick up a full 5 records at the end, or it may pick up many more. In a "lrecl" file, cio will display what *should* be the last few records in the file if there are no missing or extra bytes in between the first 5 records and the last few records; if the last few records of an "lrecl" file appear incorrect it is probably caused by this problem and the entire file should be examined. -s Forces cio to take input from stdin. .SH EXAMPLES cio -h Prints brief help message. cio -m Prints this man page. cio datafile This reads "datafile" and sends report to stdout. The report includes the first 70 characters of the first 5 records. cio datafile > reportfile As above, but writes report to "reportfile". cat datafile | cio -s Since cio reads from stdin, it can take the output of any pipe. zcat compressed_datafile.Z | cio -s As above, but reads data in "compressed_datafile.Z". zcat compressed_datafile.Z | cio -s | more Naturally, reports can be piped through a pager such as "more". If you expect bad characters in the report, you might want to use the pager "less" as it is generally better at handling them than "more." cio datafile -n 500 This reads "datafile" and sends report to stdout. The report includes the first 70 characters of the first and last 5 records and every 500th record (e.g., records 500, 1000, 1500, etc.) in the file. cio datafile -n 500 -c 20 This is just like the above example, except that the first and last 20 records are included in the report. (Note that -n and -c need not be used together.) cio datafile -q This reads "datafile" in the "quick" mode. The report is based on an examination of only the first and last 5 records of "datafile." The first 70 characters of records 1-5 are printed in the report; cio reads from the end of the file and estimates where the last 5 records will be and prints what it finds in the report. cio datafile -q -n 500 This reads "datafile" in the "quick" mode. The report is based on an examination of only the first 5 records of "datafile." The first 70 characters of records 1-5 and record 500 and the last 5 records are printed in the report. cio datafile -l 80 -n 500 "datafile" is assumed to be a datafile with no newline characters and a fixed record length ("logical record length") of 80 characters. The report includes the first 70 characters of the first 5 records and every 500th record in the file. It also prints the last 5 records. .SH ENVIRONMENT No environment variables are required, but will use MANPAGER or PAGER, if available, to display this man page .SH FILES None. .SH AUTHOR Jim Jacobs .SH "SEE ALSO" perl .SH DIAGNOSTICS .SH BUGS None known. Please report any to jajacobs@ucsd.edu