1. 04 Nov, 2018 2 commits
  2. 30 May, 2018 1 commit
  3. 05 Jun, 2017 1 commit
    • David Howells's avatar
      rxrpc: Add service upgrade support for client connections · 4e255721
      David Howells authored
      Make it possible for a client to use AuriStor's service upgrade facility.
      The client does this by adding an RXRPC_UPGRADE_SERVICE control message to
      the first sendmsg() of a call.  This takes no parameters.
      When recvmsg() starts returning data from the call, the service ID field in
      the returned msg_name will reflect the result of the upgrade attempt.  If
      the upgrade was ignored, srx_service will match what was set in the
      sendmsg(); if the upgrade happened the srx_service will be altered to
      indicate the service the server upgraded to.
      Note that:
       (1) The choice of upgrade service is up to the server
       (2) Further client calls to the same server that would share a connection
           are blocked if an upgrade probe is in progress.
       (3) This should only be used to probe the service.  Clients should then
           use the returned service ID in all subsequent communications with that
           server (and not set the upgrade).  Note that the kernel will not
           retain this information should the connection expire from its cache.
       (4) If a server that supports upgrading is replaced by one that doesn't,
           whilst a connection is live, and if the replacement is running, say,
           OpenAFS 1.6.4 or older or an older IBM AFS, then the replacement
           server will not respond to packets sent to the upgraded connection.
           At this point, calls will time out and the server must be reprobed.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
  4. 06 Apr, 2017 4 commits
  5. 10 Mar, 2017 1 commit
    • David Howells's avatar
      rxrpc: Wake up the transmitter if Rx window size increases on the peer · 702f2ac8
      David Howells authored
      The RxRPC ACK packet may contain an extension that includes the peer's
      current Rx window size for this call.  We adjust the local Tx window size
      to match.  However, the transmitter can stall if the receive window is
      reduced to 0 by the peer and then reopened.
      This is because the normal way that the transmitter is re-energised is by
      dropping something out of our Tx queue and thus making space.  When a
      single gap is made, the transmitter is woken up.  However, because there's
      nothing in the Tx queue at this point, this doesn't happen.
      To fix this, perform a wake_up() any time we see the peer's Rx window size
      The observable symptom is that calls start failing on ETIMEDOUT and the
      	kAFS: SERVER DEAD state=-62
      appears in dmesg.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  6. 07 Mar, 2017 1 commit
  7. 01 Mar, 2017 1 commit
    • David Howells's avatar
      rxrpc: Fix deadlock between call creation and sendmsg/recvmsg · 540b1c48
      David Howells authored
      All the routines by which rxrpc is accessed from the outside are serialised
      by means of the socket lock (sendmsg, recvmsg, bind,
      rxrpc_kernel_begin_call(), ...) and this presents a problem:
       (1) If a number of calls on the same socket are in the process of
           connection to the same peer, a maximum of four concurrent live calls
           are permitted before further calls need to wait for a slot.
       (2) If a call is waiting for a slot, it is deep inside sendmsg() or
           rxrpc_kernel_begin_call() and the entry function is holding the socket
       (3) sendmsg() and recvmsg() or the in-kernel equivalents are prevented
           from servicing the other calls as they need to take the socket lock to
           do so.
       (4) The socket is stuck until a call is aborted and makes its slot
           available to the waiter.
      Fix this by:
       (1) Provide each call with a mutex ('user_mutex') that arbitrates access
           by the users of rxrpc separately for each specific call.
       (2) Make rxrpc_sendmsg() and rxrpc_recvmsg() unlock the socket as soon as
           they've got a call and taken its mutex.
           Note that I'm returning EWOULDBLOCK from recvmsg() if MSG_DONTWAIT is
           set but someone else has the lock.  Should I instead only return
           EWOULDBLOCK if there's nothing currently to be done on a socket, and
           sleep in this particular instance because there is something to be
           done, but we appear to be blocked by the interrupt handler doing its
       (3) Make rxrpc_new_client_call() unlock the socket after allocating a new
           call, locking its user mutex and adding it to the socket's call tree.
           The call is returned locked so that sendmsg() can add data to it
           From the moment the call is in the socket tree, it is subject to
           access by sendmsg() and recvmsg() - even if it isn't connected yet.
       (4) Lock new service calls in the UDP data_ready handler (in
           rxrpc_new_incoming_call()) because they may already be in the socket's
           tree and the data_ready handler makes them live immediately if a user
           ID has already been preassigned.
           Note that the new call is locked before any notifications are sent
           that it is live, so doing mutex_trylock() *ought* to always succeed.
           Userspace is prevented from doing sendmsg() on calls that are in a
           too-early state in rxrpc_do_sendmsg().
       (5) Make rxrpc_new_incoming_call() return the call with the user mutex
           held so that a ping can be scheduled immediately under it.
           Note that it might be worth moving the ping call into
           rxrpc_new_incoming_call() and then we can drop the mutex there.
       (6) Make rxrpc_accept_call() take the lock on the call it is accepting and
           release the socket after adding the call to the socket's tree.  This
           is slightly tricky as we've dequeued the call by that point and have
           to requeue it.
           Note that requeuing emits a trace event.
       (7) Make rxrpc_kernel_send_data() and rxrpc_kernel_recv_data() take the
           new mutex immediately and don't bother with the socket mutex at all.
      This patch has the nice bonus that calls on the same socket are now to some
      extent parallelisable.
      Note that we might want to move rxrpc_service_prealloc() calls out from the
      socket lock and give it its own lock, so that we don't hang progress in
      other calls because we're waiting for the allocator.
      We probably also want to avoid calling rxrpc_notify_socket() from within
      the socket lock (rxrpc_accept_call()).
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarMarc Dionne <marc.c.dionne@auristor.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  8. 05 Jan, 2017 2 commits
    • David Howells's avatar
      rxrpc: Add some more tracing · b1d9f7fd
      David Howells authored
      Add the following extra tracing information:
       (1) Modify the rxrpc_transmit tracepoint to record the Tx window size as
           this is varied by the slow-start algorithm.
       (2) Modify the rxrpc_rx_ack tracepoint to record more information from
           received ACK packets.
       (3) Add an rxrpc_rx_data tracepoint to record the information in DATA
       (4) Add an rxrpc_disconnect_call tracepoint to record call disconnection,
           including the reason the call was disconnected.
       (5) Add an rxrpc_improper_term tracepoint to record implicit termination
           of a call by a client either by starting a new call on a particular
           connection channel without first transmitting the final ACK for the
           previous call.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
    • David Howells's avatar
      rxrpc: Fix handling of enums-to-string translation in tracing · b54a134a
      David Howells authored
      Fix the way enum values are translated into strings in AF_RXRPC
      tracepoints.  The problem with just doing a lookup in a normal flat array
      of strings or chars is that external tracing infrastructure can't find it.
      Rather, TRACE_DEFINE_ENUM must be used.
      Also sort the enums and string tables to make it easier to keep them in
      order so that a future patch to __print_symbolic() can be optimised to try
      a direct lookup into the table first before iterating over it.
      A couple of _proto() macro calls are removed because they refered to tables
      that got moved to the tracing infrastructure.  The relevant data can be
      found by way of tracing.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
  9. 07 Nov, 2016 1 commit
    • Paolo Abeni's avatar
      udp: do fwd memory scheduling on dequeue · 7c13f97f
      Paolo Abeni authored
      A new argument is added to __skb_recv_datagram to provide
      an explicit skb destructor, invoked under the receive queue
      The UDP protocol uses such argument to perform memory
      reclaiming on dequeue, so that the UDP protocol does not
      set anymore skb->desctructor.
      Instead explicit memory reclaiming is performed at close() time and
      when skbs are removed from the receive queue.
      The in kernel UDP protocol users now need to call a
      skb_recv_udp() variant instead of skb_recv_datagram() to
      properly perform memory accounting on dequeue.
      Overall, this allows acquiring only once the receive queue
      lock on dequeue.
      Tested using pktgen with random src port, 64 bytes packet,
      wire-speed on a 10G link as sender and udp_sink as the receiver,
      using an l4 tuple rxhash to stress the contention, and one or more
      udp_sink instances with reuseport.
      nr sinks	vanilla		patched
      1		440		560
      3		2150		2300
      6		3650		3800
      9		4450		4600
      12		6250		6450
      v1 -> v2:
       - do rmem and allocated memory scheduling under the receive lock
       - do bulk scheduling in first_packet_length() and in udp_destruct_sock()
       - avoid the typdef for the dequeue callback
      Suggested-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  10. 06 Oct, 2016 3 commits
    • David Howells's avatar
      rxrpc: Partially handle OpenAFS's improper termination of calls · b3156274
      David Howells authored
      OpenAFS doesn't always correctly terminate client calls that it makes -
      this includes calls the OpenAFS servers make to the cache manager service.
      It should end the client call with either:
       (1) An ACK that has firstPacket set to one greater than the seq number of
           the reply DATA packet with the LAST_PACKET flag set (thereby
           hard-ACK'ing all packets).  nAcks should be 0 and acks[] should be
           empty (ie. no soft-ACKs).
       (2) An ACKALL packet.
      OpenAFS, though, may send an ACK packet with firstPacket set to the last
      seq number or less and soft-ACKs listed for all packets up to and including
      the last DATA packet.
      The transmitter, however, is obliged to keep the call live and the
      soft-ACK'd DATA packets around until they're hard-ACK'd as the receiver is
      permitted to drop any merely soft-ACK'd packet and request retransmission
      by sending an ACK packet with a NACK in it.
      Further, OpenAFS will also terminate a client call by beginning the next
      client call on the same connection channel.  This implicitly completes the
      previous call.
      This patch handles implicit ACK of a call on a channel by the reception of
      the first packet of the next call on that channel.
      If another call doesn't come along to implicitly ACK a call, then we have
      to time the call out.  There are some bugs there that will be addressed in
      subsequent patches.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
    • David Howells's avatar
      rxrpc: Fix loss of PING RESPONSE ACK production due to PING ACKs · a5af7e1f
      David Howells authored
      Separate the output of PING ACKs from the output of other sorts of ACK so
      that if we receive a PING ACK and schedule transmission of a PING RESPONSE
      ACK, the response doesn't get cancelled by a PING ACK we happen to be
      scheduling transmission of at the same time.
      If a PING RESPONSE gets lost, the other side might just sit there waiting
      for it and refuse to proceed otherwise.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
    • David Howells's avatar
      rxrpc: Only ping for lost reply in client call · a9f312d9
      David Howells authored
      When a reply is deemed lost, we send a ping to find out the other end
      received all the request data packets we sent.  This should be limited to
      client calls and we shouldn't do this on service calls.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
  11. 30 Sep, 2016 4 commits
  12. 29 Sep, 2016 1 commit
  13. 24 Sep, 2016 5 commits
    • David Howells's avatar
      rxrpc: Implement slow-start · 57494343
      David Howells authored
      Implement RxRPC slow-start, which is similar to RFC 5681 for TCP.  A
      tracepoint is added to log the state of the congestion management algorithm
      and the decisions it makes.
       (1) Since we send fixed-size DATA packets (apart from the final packet in
           each phase), counters and calculations are in terms of packets rather
           than bytes.
       (2) The ACK packet carries the equivalent of TCP SACK.
       (3) The FLIGHT_SIZE calculation in RFC 5681 doesn't seem particularly
           suited to SACK of a small number of packets.  It seems that, almost
           inevitably, by the time three 'duplicate' ACKs have been seen, we have
           narrowed the loss down to one or two missing packets, and the
           FLIGHT_SIZE calculation ends up as 2.
       (4) In rxrpc_resend(), if there was no data that apparently needed
           retransmission, we transmit a PING ACK to ask the peer to tell us what
           its Rx window state is.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
    • David Howells's avatar
      rxrpc: Schedule an ACK if the reply to a client call appears overdue · 0d967960
      David Howells authored
      If we've sent all the request data in a client call but haven't seen any
      sign of the reply data yet, schedule an ACK to be sent to the server to
      find out if the reply data got lost.
      If the server hasn't yet hard-ACK'd the request data, we send a PING ACK to
      demand a response to find out whether we need to retransmit.
      If the server says it has received all of the data, we send an IDLE ACK to
      tell the server that we haven't received anything in the receive phase as
      To make this work, a non-immediate PING ACK must carry a delay.  I've chosen
      the same as the IDLE ACK for the moment.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
    • David Howells's avatar
      rxrpc: Generate a summary of the ACK state for later use · 31a1b989
      David Howells authored
      Generate a summary of the Tx buffer packet state when an ACK is received
      for use in a later patch that does congestion management.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
    • David Howells's avatar
      rxrpc: Reinitialise the call ACK and timer state for client reply phase · dd7c1ee5
      David Howells authored
      Clear the ACK reason, ACK timer and resend timer when entering the client
      reply phase when the first DATA packet is received.  New ACKs will be
      proposed once the data is queued.
      The resend timer is no longer relevant and we need to cancel ACKs scheduled
      to probe for a lost reply.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
    • David Howells's avatar
      rxrpc: Send an immediate ACK if we fill in a hole · a7056c5b
      David Howells authored
      Send an immediate ACK if we fill in a hole in the buffer left by an
      out-of-sequence packet.  This may allow the congestion management in the peer
      to avoid a retransmission if packets got reordered on the wire.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
  14. 23 Sep, 2016 5 commits
    • David Howells's avatar
      rxrpc: Add tracepoint for ACK proposal · 9c7ad434
      David Howells authored
      Add a tracepoint to log proposed ACKs, including whether the proposal is
      used to update a pending ACK or is discarded in favour of an easlier,
      higher priority ACK.
      Whilst we're at it, get rid of the rxrpc_acks() function and access the
      name array directly.  We do, however, need to validate the ACK reason
      number given to trace_rxrpc_rx_ack() to make sure we don't overrun the
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
    • David Howells's avatar
      rxrpc: Add a tracepoint to log injected Rx packet loss · 89b475ab
      David Howells authored
      Add a tracepoint to log received packets that get discarded due to Rx
      packet loss.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
    • David Howells's avatar
      rxrpc: Pass the last Tx packet marker in the annotation buffer · 70790dbe
      David Howells authored
      When the last packet of data to be transmitted on a call is queued, tx_top
      is set and then the RXRPC_CALL_TX_LAST flag is set.  Unfortunately, this
      leaves a race in the ACK processing side of things because the flag affects
      the interpretation of tx_top and also allows us to start receiving reply
      data before we've finished transmitting.
      To fix this, make the following changes:
       (1) rxrpc_queue_packet() now sets a marker in the annotation buffer
           instead of setting the RXRPC_CALL_TX_LAST flag.
       (2) rxrpc_rotate_tx_window() detects the marker and sets the flag in the
           same context as the routines that use it.
       (3) rxrpc_end_tx_phase() is simplified to just shift the call state.
           The Tx window must have been rotated before calling to discard the
           last packet.
       (4) rxrpc_receiving_reply() is added to handle the arrival of the first
           DATA packet of a reply to a client call (which is an implicit ACK of
           the Tx phase).
       (5) The last part of rxrpc_input_ack() is reordered to perform Tx
           rotation, then soft-ACK application and then to end the phase if we've
           rotated the last packet.  In the event of a terminal ACK, the soft-ACK
           application will be skipped as nAcks should be 0.
       (6) rxrpc_input_ackall() now has to rotate as well as ending the phase.
      In addition:
       (7) Alter the transmit tracepoint to log the rotation of the last packet.
       (8) Remove the no-longer relevant queue_reqack tracepoint note.  The
           ACK-REQUESTED packet header flag is now set as needed when we actually
           transmit the packet and may vary by retransmission.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
    • David Howells's avatar
      rxrpc: Fix accidental cancellation of scheduled resend by ACK parser · be8aa338
      David Howells authored
      When rxrpc_input_soft_acks() is parsing the soft-ACKs from an ACK packet,
      it updates the Tx packet annotations in the annotation buffer.  If a
      soft-ACK is an ACK, then we overwrite unack'd, nak'd or to-be-retransmitted
      states and that is fine; but if the soft-ACK is an NACK, we overwrite the
      to-be-retransmitted with a nak - which isn't.
      Instead, we need to let any scheduled retransmission stand if the packet
      was NAK'd.
      Note that we don't reissue a resend if the annotation is in the
      to-be-retransmitted state because someone else must've scheduled the
      resend already.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
    • David Howells's avatar
      rxrpc: Use before_eq() and friends to compare serial numbers · 98dafac5
      David Howells authored
      before_eq() and friends should be used to compare serial numbers (when not
      checking for (non)equality) rather than casting to int, subtracting and
      checking the result.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
  15. 22 Sep, 2016 4 commits
    • David Howells's avatar
      rxrpc: Reduce the number of PING ACKs sent · fc943f67
      David Howells authored
      We don't want to send a PING ACK for every new incoming call as that just
      adds to the network traffic.  Instead, we send a PING ACK to the first
      three that we receive and then once per second thereafter.
      This could probably be made adjustable in future.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
    • David Howells's avatar
      rxrpc: Obtain RTT data by requesting ACKs on DATA packets · 50235c4b
      David Howells authored
      In addition to sending a PING ACK to gain RTT data, we can set the
      RXRPC_REQUEST_ACK flag on a DATA packet and get a REQUESTED-ACK ACK.  The
      ACK packet contains the serial number of the packet it is in response to,
      so we can look through the Tx buffer for a matching DATA packet.
      This requires that the data packets be stamped with the time of
      transmission as a ktime rather than having the resend_at time in jiffies.
      This further requires the resend code to do the resend determination in
      ktimes and convert to jiffies to set the timer.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
    • David Howells's avatar
      rxrpc: Send pings to get RTT data · 8e83134d
      David Howells authored
      Send a PING ACK packet to the peer when we get a new incoming call from a
      peer we don't have a record for.  The PING RESPONSE ACK packet will tell us
      the following about the peer:
       (1) its receive window size
       (2) its MTU sizes
       (3) its support for jumbo DATA packets
       (4) if it supports slow start (similar to RFC 5681)
       (5) an estimate of the RTT
      This is necessary because the peer won't normally send us an ACK until it
      gets to the Rx phase and we send it a packet, but we would like to know
      some of this information before we start sending packets.
      A pair of tracepoints are added so that RTT determination can be observed.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
    • David Howells's avatar
      rxrpc: Add re-sent Tx annotation · f07373ea
      David Howells authored
      Add a Tx-phase annotation for packet buffers to indicate that a buffer has
      already been retransmitted.  This will be used by future congestion
      management.  Re-retransmissions of a packet don't affect the congestion
      window managment in the same way as initial retransmissions.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
  16. 17 Sep, 2016 4 commits