All java classes for these notes are found in the package java.io.

Streams

A stream is a series of bytes either:

·       Written by the program to memory, an output device, or a file on the disk or

·       Read by the program from one of these sources

 

Files

File objects

The File class and its subclasses provide methods that facilitate manipulating data in a file on the disk.

The disk is a place where data can be written at some point in time by program1 and read by program2 at an arbitrary time in the future.

 

File methods

Objects instantiated from these File classes allow the programmer to locate and access files on the disk.  When a program creates a File object, it must provide a String that is the name of the disk file this File object will manipulate.  They provide the following general operations:

exists

Return whether or not there is a file on disk having the name specified when the File object was created.

canRead, canWrite

Disk files are protected by the operating system such that only certain users can read or write them.  These methods tell a program whether it is able to read or write the underlying disk file.

<others>

There are many operations a program can perform on a disk file using a File object.  See the File class API.

open

Reserve the underlying disk file for either:

·       Reading – in general, multiple different File objects can open the same disk file for reading.

·       Writing – in general, only one File object can open a disk file for writing.  A File object attempting to open a disk file for writing that has already been opened by another File object will get an exception.

So, when a File object ‘opens’ a disk file, this object now has the ability to read or write the contents of this disk file depending on the kind of open the File object did.

 

Creating an object using one of the subclasses of File generally performs the open operation.

close

Release the reservation created when the File object excecuted its open method.

read

Read some number of bytes from the underlying disk file into memory.

write

Write some number of bytes from memory to the underlying disk file.

seek

File objects generally keep a pointer (sometimes called cursor) that holds the next byte in the file to read or write.  Some File subclasses allow you to move this pointer without reading or writing any data.

 

Traversal

Most programming languages that allow file reading and writing generally provide two ways to traverse the data in a file:

·       Sequential – each data item must be read (or written) immediately after the last data item read (or written).  Opening a file for sequential access requires opening it for read or for write, but not for both.

·       Random – at any point in time, the programmer may read or write the next data item or use the seek operation to position the file pointer to an arbitrary data item in the file.

 

Data Item Size and Type

Java supports interpreting the data items in a file as follows:

·       As numbers – byte, short, int, long, float, double values – all signed values

·       As booleans – each boolean takes a whole byte – either a 0 (false) or a 1 (true)

·       As chars – chars in java are two byte Unicode characters, but when they are written to disk the amount of space they occupy in the file depends on the encoding.  See the ‘Encoding’ section below.

File Subclasses

‘Binary’ data files

Sequential access to the data in a file. 

Write - DataOutputStream wraps FileOutputStream

Read – DataInputStream wraps FileInputStream

Methods

writeBoolean, // write 1 byte with value either 0 false or 1 true

writeChar,     // write 2 bytes in UTF-16 format

writeByte,      // write only 1 byte – the low byte of the value passed in

writeShort, writeInt, writeLong, writeFloat, writeDouble, // write the appropriate number of bytes in the appropriate format

writeUTF(String) writes strings in UTF-8 encoding

 

and corresponding methods with the same names for reading

 

see demos/BinaryDataWriteDemo.java

 

Random data files

RandomAccessFile raf = new RandomAccessFile(path”,”access”);

Acts like both a stream and a file, i.e., doesn’t need an additional stream class.

 

access” notes:

-       Can be “rw”, “r”, “w”.

-       If “r”, the file must pre-exist.

-       If “rw” or “w” and the file doesn’t exist, it will be created.

-       rwd” writes with immediate updates to storage.

 

seek() allows you to move the file pointer a number of bytes

 

See demos/RAFDemo.java

 

Text Files

PrintWriter - characters are converted into bytes according to the platform's default character encoding

BufferedReader wraps FileReader

-       Allows:

o   Read an entire line.  The readLine() method returns null if at the end of file, but it consumes input if not at end of file. 

o   Limited moving of the file pointer (mark, skip, reset).  The mark() method will throw an exception if there is no more data to read.

 

-       From the API:

In general, each read request made of a Reader causes a corresponding read request to be made of the underlying character or byte stream. It is therefore advisable to wrap a BufferedReader around any Reader whose read() operations may be costly, such as FileReaders and InputStreamReaders. 

 

Exceptions

All of the classes in package java.io have methods, which have ‘throws’ clauses in their signatures.

When you call these methods, you must either provide a handler for these exceptions or put a ‘throws’ clause in your method signature.

Hexdump

A tool that allows you to inspect individual bytes in a file.

It is a command line program available on linux, mac, and windows.

 

hexdump –C <filename> 

 

displays the contents of a file in the way we have been doing in class.

 

Running in PCW:

Start | cmd

PATH="X:CS258 (6448-Spring2019)\CourseFiles";%PATH%

P:

cd cs258\ws

The above cd command assumes your workspace is on your P drive in the directory cs258\ws

 

Installation options

Mac users can start a terminal window and type in the command.

 

Same for linux users.

 

If you are on windows, you can download a stand-alone version of this tool from https://www.di-mgt.com.au/hexdump-for-windows.html (scroll to the bottom of the page.).

Extract the zip file to C:

Copy C:\hexdump-2.0.0\hexdump.exe to C: 

(You will likely need administrator permission to do this.  If you are on a system without administrator permission, unzip/copy hexdump.exe to any directory where you do have permission, e.g., P:\)

 

Windows 10 users can also install the windows bash shell by following the instructions here https://www.laptopmag.com/articles/use-bash-shell-windows-10

 

Running Hexdump

1.    Start a terminal window (cmd window if you are running the stand alone Windows version, terminal on a mac)

2.    Change directory to the directory where your file is.   This step can be tricky on the windows bash shell and on macs.

3.    Run the hexdump utility as described above.

Example

Here is a run on my windows system using the stand-alone hexdump.exe:

1.     Open a command line: start | cmd  -- open a terminal window on mac or linux

2. cd <eclipseProjectDirectory>

3.     C:\hexdump –C iodemotest.bin
For Windows with hexdump copied to C:\
For mac or linux:
hexdump -C iodemotest.bin

 

Here is a picture of what I did after start | cmd:

 

 

Yellow text is what I typed on the command line.  (I was a little sloppy with my highlight …)

Red underline is the path to the Demos project directory in my eclipse workspace.

Replace this path with yours.

 

The file iodemotest.bin was written by running demos/BinaryDataWriteDemo.java.

Character sets

Excellent and humorous introduction to the topic.  Please read.

https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

 

Ascii

We know what this one is: 7 bits per character values 00 – 7F  (0 – 127 decimal) generally each character stored in 1 byte.

 

Unicode

Java uses two byte Unicode characters, which is all of the characters in Unicode Plane 0 – see below.

https://en.wikipedia.org/wiki/Code_point

In character encoding terminology, a code point or code position is any of the numerical values that make up the code space.[1] Many code points represent single characters but they can also have other meanings, such as for formatting.

For example, the character encoding scheme ASCII comprises 128 code points in the range 0hex to 7FhexExtended ASCII comprises 256 code points in the range 0hex to FFhex, and Unicode comprises 1,114,112 code points in the range 0hex to 10FFFFhex. The Unicode code space is divided into seventeen planes (the basic multilingual plane, and 16 supplementary planes), each with 65,536 (= 216) code points. Thus the total size of the Unicode code space is 17 × 65,536 = 1,114,112.

https://en.wikipedia.org/wiki/Unicode#Code_point_planes_and_blocks

https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane

Encodings

UTF-8 -- https://en.wikipedia.org/wiki/UTF-8

Used by over 90% of the existing websites.

Note in a 4 byte character, there are 21 x’s in the table above.

 

Set Eclipse to use UTF-8 encoding

Window | Preferences | General | Workspace

https://en.wikipedia.org/wiki/Windows-1252#Character_set

 

Still unknown why we don’t get 128-159 shown in the above link.  They are not ISO 8859-1, but I thought they should display if using eclipse default Cp1252.

 

Note: Java chars are two bytes long.  2-byte characters can represent only enough code points for the Basic Multilingual Plane.  Java has been enhanced to cover the supplementary planes, but I do not know the specifics of these enhancements.

 

 

See demos/CharsetEncodingTest.java

 

Here is an example of encoding a large Unicode value to UTF8 encoding:

        dao.writeUTF(""+ 

// first 2 bytes show len of string 00 08  -- 8 byte long string

                   '\u20ac'+

// 3 bytes: b1: {1110} [0010], b2: {10}[00 00][10], b3: {10}[10] [1100]

//                E      2            8       2             A      C

                   '\uFFFD'+

// 3 bytes: b1: {1110} [1111], b2: {10}[11 11][11], b3: {10}[11] [1101]

//                E      F            B       F             B      D

                   '\u0250');

// 2 bytes: b1: {110} [0 10  01], b2: {10}[01 0000] 

//                  C      9             9     0