Reading large ASCII files PRODUCT/PLATFORM INFORMATION Product/Version #: PV-Wave / 7.01 Architecture/OS version: Any / All / PROBLEM DESCRIPTION When attempting to read large ASCII files, one may run into limits of both DC_READ_FREE and READF. For example. DC_READ_FREE will only allow you to read in 255 values in a row, while the buffer for reading a string with READF is 2048 characters. SOLUTION Date Received: 03/30/2001 EJS Date Updated: 05JUN2001 gdr Added a version READ2BIG which handles multiple delimiters, like double spaces which might occur in reading fixed format numeric data. READ2BIG is simply appended, rather than modifying READBIG, which has the value of inserting null fields when double ",," delimiters are encountered. Using the function below. It has been adapted from an older tip, made more general, and can be called as a function. CRs have been filed to increase the above limits in DC_READ_FREE and READF, but until then feel free to use or adapt the code below. ; (c) Copyright Visual Numerics Inc, 2001 - Boulder Colorado USA ; $Id: readbig.pro, v 0.5 2001/05/08 11:52:00 estewart $ ; ;+ ; NAME: ReadBig ; PURPOSE: ; Read in files that DC_READ_FREE and READF cannot (Version 7.01) ; DC_READ_FREE can only get 255 values in a row ; a = INTARR(300,5) ; status = DC_READ_FREE(filename, a, /row) ; -> % DC_READ_FREE: Input record is too long. ; READF can only read 2048 characters ; OPENR, 1, filename & in = '' ; READF, 1, in ; -> % Input line is too long for input buffer of 2048 characters. ; ; CALLING SEQUENCE: ; result = ReadBig('filename' [, delim=delimarray, /NoLF, ; /Byte.../Double]) ; ; INPUTS: filename = the name of the file to read, a string ; ; KEYWORDS: ; delim = a string of delimiters to use (by default, comma, space, tab ; and LF are all included) ; /nolf = do not use the LF (linefeed) as a delimiter ; /byte, /integer, /long, /float, /double = set type of returned array ; (default = /string). The most complex type takes precedence if more ; than one are called ; ; OUTPUTS: ; result = a vector containing all elements between delimiters in the ; of the requested type. If the read fails, result = -1 ; ; PROCEDURE: ; An ASCII file with specific delimiters is read in as a binary BYTE ; array. This array is parsed and data values between delimiters are ; extracted and converted to a STRING vector, which is typed if ; requested, then returned. ; ; MODIFICATION HISTORY: ; Origin: estewart VNI 2/01 ;- FUNCTION ReadBig, filename, Delim=delim, NoLF=nolf, string=string, $ byte=byte, integer=integer, long=long, float=float, double=double ON_ERROR, 2 type = 0 IF N_Elements(byte) NE 0 THEN type = 1 IF N_Elements(integer) NE 0 THEN type = 2 IF N_Elements(long) NE 0 THEN type = 3 IF N_Elements(float) NE 0 THEN type = 4 IF N_Elements(double) NE 0 THEN type = 5 IF N_Elements(nolf) EQ 0 THEN lf = 1 ELSE lf = 0 ; LF is a delimiter? IF N_Elements(delim) EQ 0 THEN $ delims = [09B, 32B, 44B] $ ELSE delims = BYTE(delim) IF lf NE 0 THEN delims = [delims,10B] ; add LF as a delimiter numdelims = N_Elements(delims) ; number of delimiters ON_IOERROR, ioerr OPENR, unit, filename, /Get_Lun filestat = FSTAT(unit) ; get file information filesize = filestat.size ; specifically, the size inbyte = BYTARR(filesize, /NoZero) ; array to hold the file READU, unit, inbyte CLOSE, unit & FREE_LUN, unit FOR i = 0L,numdelims-1 DO BEGIN loctemp = WHERE(inbyte EQ delims(i),count) ; where delim(i) exists IF count NE 0 THEN BEGIN IF SIZE(locdelims, /Type) EQ 0 THEN $ ; initialize if undefined locdelims = loctemp $ ELSE locdelims = [locdelims,loctemp] ; concatenate if defined ENDIF ENDFOR locdelims = locdelims(SORT(locdelims)) numvals = N_ELEMENTS(locdelims) ; Number of values in file data = STRARR(numvals) ; array to hold the data first = 0L ; reference the counter IF lf EQ 0 THEN last = numvals - 2L $ ELSE last = numvals - 1L FOR i = 0L, last DO BEGIN ; Parse the byte array data(i) = STRING(inbyte(first:locdelims(i))) first = locdelims(i) + 1 ENDFOR IF lf EQ 0 THEN $ ; get final value if no LF data(numvals-1) = STRING(inbyte(locdelims(numvals-2):*)) inbyte = 0B ; Deallocate the byte array CASE type OF 1: data = BYTE(FLOAT(data)) 2: data = FIX(FLOAT(data)) ; in case of exponential notation 3: data = LONG(FLOAT(data)) ; in case of exponential notation 4: data = FLOAT(data) 5: data = DOUBLE(data) ELSE: ENDCASE RETURN, data ioerr: MESSAGE, 'Error reading file: ' + filename + ', ' + !Err_string, /Continue RETURN, -1 END ABOVE: ReadBig - handles consecutive delimiters with nulls (0.1,,5) BELOW: Read2Big - ignores repeated delimiters (double spaces) _+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_ ; (c) Copyright Visual Numerics Inc, 2001 - Boulder Colorado USA ; $Id: readbig.pro, v 0.5 2001/05/08 11:52:00 estewart $ ; ;+ ; NAME: Read2Big ; PURPOSE: ; Read in files that DC_READ_FREE and READF cannot (Version 7.01) ; DC_READ_FREE can only get 255 values in a row ; a = INTARR(300,5) ; status = DC_READ_FREE(filename, a, /row) ; -> % DC_READ_FREE: Input record is too long. ; READF can only read 2048 characters ; OPENR, 1, filename & in = '' ; READF, 1, in ; -> % Input line is too long for input buffer of 2048 characters. ; ; CALLING SEQUENCE: ; result = ReadBig('filename' [, delim=delimarray, /NoLF, ; /Byte.../Double]) ; ; INPUTS: filename = the name of the file to read, a string ; ; KEYWORDS: ; delim = a string of delimiters to use (by default, comma, space, tab, CR ; and LF are all included) ; /byte, /integer, /long, /float, /double = set type of returned array ; (default = /string). The most complex type takes precedence if more ; than one are called ; ; OUTPUTS: ; result = a vector containing all elements between delimiters in the ; of the requested type. If the read fails, result = -1 ; ; PROCEDURE: ; An ASCII file with specific delimiters is read in as a binary BYTE ; array. This array is parsed and data values between delimiters are ; extracted and converted to a STRING vector, which is typed if ; requested, then returned. ; N.B.: We treat all delimiters equally and DO NOT infer missing data on ; the occurrence of, say, " 45,,67 ". Treating Missing Data could ; be easily added, by including a vector of MD values, ; but it isn't there now. ; Also, Note that "123.34 " will be read properly, but ", 10,123.34, " ; will be read as ten followed by the first number, not as ten thousand. ; ; MODIFICATION HISTORY: ; Origin: estewart VNI 2/01 ; Modify: grodd VNI 5/01 ;- FUNCTION Read2Big, filename, Delim=delim, string=string, $ byte=byte, integer=integer, long=long, float=float, double=double ON_ERROR, 2 type = 0 IF N_Elements(byte) NE 0 THEN type = 1 IF N_Elements(integer) NE 0 THEN type = 2 IF N_Elements(long) NE 0 THEN type = 3 IF N_Elements(float) NE 0 THEN type = 4 IF N_Elements(double) NE 0 THEN type = 5 IF N_Elements(delim) EQ 0 THEN $ delims = [09B, 32B, 44B, 13B, 10B] $ ; [TAB, Space, ",", CR, LF] ELSE delims = BYTE(delim) numdelims = N_Elements(delims) ; number of delimiters ; print, "List of Delimiters" ; pm, delims ON_IOERROR, ioerr OPENR, unit, filename, /Get_Lun filestat = FSTAT(unit) ; get file information filesize = filestat.size ; specifically, the size inbyte = BYTARR(filesize, /NoZero) ; array to hold the file READU, unit, inbyte CLOSE, unit & FREE_LUN, unit ; inbyte = [delims(0), inbyte] ; We make things simpler by sticking a delimiter at the very beginning of the data ; Print, "Byte values for checking" ; This helps us pick up the first datum ( = data element) ; print, inbyte ; ; xeof = N_Elements(inbyte) ; print, "There are " + string(xeof) + " bytes to mush." locdelims = wherein( inbyte, delims) locdelims = locdelims(SORT(locdelims)) ; We know where the delimiters are now, and if ; we substract D = [locdelims(1:*), xeof] - locdelims ; info, D ; we have an array containing the distance from ; one delimiter to the next ; Print, "Sorted indices of delimiters" ; print, locdelims ; Print, "Subset of delimiters" ; print, inbyte(locdelims) ; ; Where the distance between delimiters is 1, we have adjacent ; delimiters; for this routine we will assume that we ; want to handle data in which delimiters can come next to ; one another, but that this does not indicate missing data or null values. ; print, "See ones where there are repeated delimiters?", D ; Print, "Now, Which locations mark the beginning of NON-delimeters?" D0 = locdelims( where( D GT 1)) D1 = D0 + 1 D2 =D0 + (D( where( D GT 1)) - 1 ) ; Since D gives the lengths, D-1 is what we need to add NVars = N_elements(D1) ; ; ; print, "Where are the non-delims?" print, "****************************" info, /var ; pm, [[d1], [d2]] print, "****************************" ; data = STRARR(NVars) ; array to hold the data FOR i = 0L, NVars - 1 DO BEGIN ; Parse the byte array data(i) = STRING(inbyte(D1(i):D2(i))) ; the ":" syntax requires scalars ENDFOR inbyte = 0B ; Deallocate the byte array CASE type OF 1: data = BYTE(FLOAT(data)) 2: data = FIX(FLOAT(data)) ; in case of exponential notation 3: data = LONG(FLOAT(data)) ; in case of exponential notation 4: data = FLOAT(data) 5: data = DOUBLE(data) ELSE: ENDCASE RETURN, data ioerr: MESSAGE, 'Error reading file: ' + filename + ', ' + !Err_string, /Continue RETURN, -1 END _+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_