/** * This method updates compressed stream position exactly when the * client of this code has read off at least one byte passed any BZip2 * end of block marker. * * This mechanism is very helpful to deal with data level record * boundaries. Please see constructor and next methods of * org.apache.hadoop.mapred.LineRecordReader as an example usage of this * feature. We elaborate it with an example in the following: * * Assume two different scenarios of the BZip2 compressed stream, where * [m] represent end of block, \n is line delimiter and . represent compressed * data. * * ............[m]......\n....... * * ..........\n[m]......\n....... * * Assume that end is right after [m]. In the first case the reading * will stop at \n and there is no need to read one more line. (To see the * reason of reading one more line in the next() method is explained in LineRecordReader.) * While in the second example LineRecordReader needs to read one more line * (till the second \n). Now since BZip2Codecs only update position * at least one byte passed a maker, so it is straight forward to differentiate * between the two cases mentioned. * */ public int read(byte[] b, int off, int len) throws IOException { if (needsReset) { internalReset(); } int result = 0; result = this.input.read(b, off, len); if (result == BZip2Constants.END_OF_BLOCK) { this.posSM = POS_ADVERTISEMENT_STATE_MACHINE.ADVERTISE; } if (this.posSM == POS_ADVERTISEMENT_STATE_MACHINE.ADVERTISE) { result = this.input.read(b, off, off + 1); // This is the precise time to update compressed stream position // to the client of this code. this.updatePos(true); this.posSM = POS_ADVERTISEMENT_STATE_MACHINE.HOLD; } return result; }