Hi Vassili,

on Sun 07-11-1999 02:09 Vassilis Papathanassiou wrote:
>on Sat 06-11-1999 00:38 Ronald Andersson wrote:
>
>>Yes, Okami can do some very destructive things with mail if misconfigured.
>>Internally it treats mail as a special case of newsgroups, and for those
>>it is quite normal to erase stuff regularly, even if unread...  I have
>>spent many hours fiddling with my O.INF and the other files to get it to
>>work as I like it.  (Well, almost anyway...  ;-)
>>
>Seems that Okami moves the mails addressed to the main user to a archive
>folder ie mail for papval goes to papval.archive after the 'expire'
>period. These can't be deleted by accident I think. But it can happen for
>other mailboxes though, eg for stik-beta mailbox.

That should depend on the configs in O.INF too.  Properly configured you
should get the same behaviour for all mailboxes.  At least I think so,
although I only use a single mailbox (but several mailfolds).


>>I have MAGX_NET.LZH and MGXNET_L.LZH now, but have not yet had time for them.
>>
>Ok, so they have been arrived (this is an ACK that verifies that the
>sending part has been informed of correct delivery :-)

ACKing ACKs are we...?  Here's another right back at ya  ;-)
(Good thing we're above TCP level here :-)


>----- snip ----- re: BNeT under construction (again)
>
>>Ok, but remember, tricky servers and device drivers were the main reason
>>why I introduced the DEFER mode, now implemented both for TCP and UDP.
>>
>This could be the solution for Single_TOS correct initialisation of the
>connection but I'm not sure yet.

I think it is needed there, at least if you are to make any calls to the
TCP API from inside GEMDOS calls.  Or similar calls from interrupt driven
code installed by TIMER_call.  Some few functions are safe to use anyway,
such as CNbyte_count, or TCP_info, that would never block anyway.  But all
routines that might need to delay, and would then call _appl_yield, will
need to have this prevented by using DEFER mode for the connection.


>>If any of your problems relate to _appl_yield, or other 'unblocking'
>>issues, then that mode may be a solution.  My new NetD is dependent on it,
>>and could not work without it, as that would cause illegal system calls in
>>interrutpts and thus also inside TOS functions etc.
>>
>No, my problem now is to pass the correct pointer to Single_TOS routines
>since it is now variable for received blocks. This whole thing started
>after I found out from TCP sources that the function 'receive' in CN_xx
>functions, allocates a new block for CNget_block and copies the buffer
>there but just returns the pointer for GNget_NDB. For this reason we
>have to free this NDB after processing the data.

Ok, I see the need, but not the problem.  That is how CNget_NDB has always
been defined, with the caller being responsible for calling KRfree later.
It has been this way since before STinG existed, as it is a STiK spec.
It is the same way with some other API functions too, like 'resolve'.

The only difference here is that an NDB needs two KRfree calls, one for the
data block and one for the NDB itself.  Was that the problem...?


>Still, there is also
>something I don't understand there, ie my_CNget_NDB routine cuts the
>link to ndb->next without checking if more blocks are in the queue.
>This makes even the old STiK example for CNget_NDB useless, ie a user
>program can only process one block at a time.

That is not the case, as illustrated below:
----- excerpt from my_CNget_NDB in TCP.C -----
	flag = -1;
	IF_lock(conn,test_f,1000L)
		return((NDB *) E_LOCKED);
	receive (conn, (uint8 *) & ndb, & flag, FALSE);
	END_lock(conn);

	if (flag < 0)
		return (NULL);
	else
	{	ndb->next = NULL;
		return (ndb);
	}
----- end of excerpt -----

You have misunderstood the meaning of the last part.  Setting ndb->next=NULL
is only done as an extra safety measure, so that if a program attempts to use
the content of the next field (which is wrong), they will only find that NULL,
rather than the next NDB which is still part of the reception buffer, and
which must therefore be left alone.

The real unlinking of the current NDB from the reception buffer is not made
at this call level, but is made in the 'receive' function in TOOL.C, which
is called as shown earlier in the excerpt above.  Calling that function
with a length value of -1, like 'flag' here, causes that function to work
in direct NDB mode, rather than treating the NDBs as combined buffers for
a common data stream.

The code most vital to understand this is shown below:
----- excerpt from my_CNget_NDB in TCP.C -----
int16	receive (CONNEC *connec, uint8 *buffer, int16 *length, int16 getchar)

{	NDB  *ndb;

	if (*length >= 0)
	{	if (*length > connec->recve.count)
			*length = connec->recve.count;
		pull_up (& connec->recve.queue, buffer, *length);
	}
	else
	{	if ((ndb = connec->recve.queue) == NULL)
			*length = -1;
		else
		{	*length = ndb->len;
			* (NDB **) buffer = ndb;
			connec->recve.queue = ndb->next;
		}
	}
----- end of excerpt -----

It is 'connec->recve.queue = ndb->next;' which removes the current NDB
from the reception queue, replacing it with the next one, so it is then
quite safe for my_CNget_NDB to NUL its 'next' field later, as it has already
been used to assure that the next NDB is at the head of the queue.


>STinG does also something strange when sending data, ie during TCP_send
>it allocates a block of memory based on the amount of data we want to
>send and copies them there, which is IMO a waste of CPU time. Note that
>opening a TCP connection we define a buffer for it, but seems that this
>buffer is only used in calculations of the TCP window and not to handle
>actual user data.

The usage of the term 'buffer' for that unit is a remnant from STiK defs.
It has never been a physical buffer in STinG programming.


>Note that this way we have two memcpys in vain (?)
>one when the user program copies the data where it SHOULD be the output
>buffer and one when my_TCP_send is called.

Those buffers are not equivalent, and can't replace each other. The user's
output buffer may be located anywhere, and it does not have to maintain
storage of the data during network transmission.  The internal NDB buffers
must be KRmalloc blocks, as must the data block buffers they point to.
Otherwise the whole system of buffer allocation and release falls down.

We can not entrust something that basic to the whims of each client.
Then a single faulty client would cause total network havoc.

This might seem wasteful to you, but that is partly because you think of
the user buffer as a dedicated one, which it doesn't have to be.

For a text editor (for example) that wants to transmit its current text,
there is no need to memcpy anything in the client.  Just point to the text
start and use that for the TCP_send, then repeat as needed by incrementing
the pointer with the 'theoretical' buffer size for each new TCP_send.
This way there is no waste.


>not to mention that to construct the IP packet several other memcpys are
>needed (probably some of them unavoidable).

That is true, but those are indeed needed.


>Since there are a lot of memcpys in STinG (some of them 0 bytes(!) ie
>when there are no IP 'options') and many more just 20 bytes or so,

That may be inefficient, but for some cases it is more costly in CPU time
to perform the extra tests that are needed to avoid the the work that a
special case does not need.  I do think that the code you mention could
benefit from optimization of this though.


>I'd suggest (at least for a start, until we maybe find something more
>elegant) a new memcpy routine, preferably in assembly and going backwards
>(which is AFAIK the fastest way for MC68xxx processors). This will
>replace the one in Pure_C's library. The lib routine is fine as a
>general memcpy method but it tests the amount of bytes to be copied,
>(the limit for the lib memcpy is 256 bytes IIRC) after this value it
>uses multiple register moves (saving and restoring most registers from
>the stack), calculates the remaining bytes etc which is IMO just 
>overhead for our needs. For us, even full MTU blocks can be copied with
>a simple move -(a1), -(a0) or move (a1)+, (a0)+ in a loop, whithout 
>affecting 68000 performance, and making cache processors faster.

You are partly correct in this, but also partly incorrect (IMHO).
I am with you on the idea in general, of making moves in new ways, but not
on your evaluation of the choices available.  Also, we must consider well
before choosing changes that only improve cached speed while causing great
slowdowns on uncached systems.

The following discusses some aspects of uncached behaviour.

----- example 1: the simple loop you want -----
.loop:
	move	(a0)+,(a1)+
	dbra	d0,.loop
----- example 1 end -----
For each copied word, which obviously requires 1 read and 1 write access,
this method also requires an additional 2 read accesses for opcodes.

This means that transfer efficiency is down to 50%, compared to the 67% that
would result from using an inline sequence of moves.  Even though that does
require some computation of entry point, that will always be compensated for
if the data block is large.  But then this may not be the best method...

For large data blocks it is *MUCH* faster to use multiple register moves
as this minimizes the access losses of opcode fetching and interpretation.
But it can only be done when both source and dest address are even, or for
a subrange of the data when both source and dest address are odd.  It the
source is odd but the dest is even, or vice versa, then it can't be used.
(As there is no 'movem.b' instruction variant.)

When it can be used, just using half the possible registers, thus moving
32 bytes == 16 words per 2 instructions of 2 word each, gives a transfer
efficiency of 16/20, which is exactly 80%.  Using *all* the registers
(should only be done with interrupts off, due to interrupt use of SP),
will mean transferring 64 bytes == 32 words per 2 instructions of 2 words,
thus raising the efficiency to 32/36, which is appx 89%.  As you can see
there is a 'law of diminishing returns' involved, so 8 registers is a good
compromize.  It gives a binary power size-per-move, and avoids having to
use SP, which can be dangerous.


Next down in efficiency is the multiple moves mentioned, being simply inline
sequences of move instructions.  These have much greater losses than the
multiple register moves, but are valid for all address combinations if the
move instruction used is 'move.b' (but that only gives efficiency of 33%).


Then there is the compromise method used by memcpy, which uses multiple
moves combined with looping to achieve fairly good speed, though it
is inevitably lower than that of the methods based purely on multiple
moves (with or without multiple regs).

Finally we have the worst speed of all, in a simple loop with 'move.b'.
and 'dbra'.  This will need a minimum of six read accesses and two
write accesses to transfer one 16bit word.  That is intolerable except
for some very small blocks, since the transfer efficiency is down to 25%.

But for 'odd-to-even' or 'even-to-odd' transfers 33% is the maximum  :-(
So such situations should be avoided whenever possible.


>Well, Pure_C's memcpy is not that bad as it might seem from my analysis
>above, I just think we have an easy way to optimize it (and considering
>the fact that we don't have exactly fast computers, every 'tick' of the
>clock counts). The main problem I think, is that there is a long distance
>between STinG internal memory and client program memory. Since STinG can
>not rely on user memory, maybe user programs should make a better usage
>of STinG's internal memory. This needs a lot of work of course, so we
>can keep it as a thought for the feature.

That is actually a different subject, but getting back to the choice of
memcpy replacements I think that the main problem is that testing for
special cases (odd address, data size etc) is one of the main problems,
and that makes it difficult to optimize any such moves that deal with
client-supplied buffers and/or buffers of client-specified sizes.

A much better case for optimization exists in manipulating blocks defined
and created by STinG itself, and that is where I think we can make the best
improvements.

-- 
-------------------------------------------------------------------------
Regards:  Ronald Andersson                  mailto:dlanor@ettnet.se
http://dlanor.atari.org/    ICQ:38857203    http://www.ettnet.se/~dlanor/
-------------------------------------------------------------------------
