1. 26 Apr, 2018 1 commit
    • Jens Axboe's avatar
      blk-mq: fix discard merge with scheduler attached · 73027d80
      Jens Axboe authored
      [ Upstream commit 445251d0 ]
      I ran into an issue on my laptop that triggered a bug on the
      discard path:
      WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3/0x430
       Modules linked in: rfcomm fuse ctr ccm bnep arc4 binfmt_misc snd_hda_codec_hdmi nls_iso8859_1 nls_cp437 vfat snd_hda_codec_conexant fat snd_hda_codec_generic iwlmvm snd_hda_intel snd_hda_codec snd_hwdep mac80211 snd_hda_core snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq x86_pkg_temp_thermal intel_powerclamp kvm_intel uvcvideo iwlwifi btusb snd_seq_device videobuf2_vmalloc btintel videobuf2_memops kvm snd_timer videobuf2_v4l2 bluetooth irqbypass videobuf2_core aesni_intel aes_x86_64 crypto_simd cryptd snd glue_helper videodev cfg80211 ecdh_generic soundcore hid_generic usbhid hid i915 psmouse e1000e ptp pps_core xhci_pci xhci_hcd intel_gtt
       CPU: 2 PID: 207 Comm: jbd2/nvme0n1p7- Tainted: G     U           4.15.0+ #176
       Hardware name: LENOVO 20FBCTO1WW/20FBCTO1WW, BIOS N1FET59W (1.33 ) 12/19/2017
       RIP: 0010:nvme_setup_cmd+0x3d3/0x430
       RSP: 0018:ffff880423e9f838 EFLAGS: 00010217
       RAX: 0000000000000000 RBX: ffff880423e9f8c8 RCX: 0000000000010000
       RDX: ffff88022b200010 RSI: 0000000000000002 RDI: 00000000327f0000
       RBP: ffff880421251400 R08: ffff88022b200000 R09: 0000000000000009
       R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000ffff
       R13: ffff88042341e280 R14: 000000000000ffff R15: ffff880421251440
       FS:  0000000000000000(0000) GS:ffff880441500000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 000055b684795030 CR3: 0000000002e09006 CR4: 00000000001606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        ? __sbitmap_queue_get+0x24/0x90
        ? blk_mq_get_tag+0xa3/0x250
        ? wait_woken+0x80/0x80
        ? blk_mq_get_driver_tag+0x97/0xf0
        ? deadline_remove_request+0x49/0xb0
        ? submit_bio+0x5c/0x110
        ? __blkdev_issue_discard+0x152/0x200
        ? account_page_dirtied+0xe2/0x1a0
        ? kjournald2+0xb0/0x250
        ? wait_woken+0x80/0x80
        ? commit_timeout+0x10/0x10
        ? kthread_create_worker_on_cpu+0x50/0x50
        ? do_group_exit+0x3a/0xa0
       Code: 73 89 c1 83 ce 10 c1 e1 10 09 ca 83 f8 04 0f 87 0f ff ff ff 8b 4d 20 48 8b 7d 00 c1 e9 09 48 01 8c c7 00 08 00 00 e9 f8 fe ff ff <0f> ff 4c 89 c7 41 bc 0a 00 00 00 e8 0d 78 d6 ff e9 a1 fc ff ff
       ---[ end trace 50d361cc444506c8 ]---
       print_req_error: I/O error, dev nvme0n1, sector 847167488
      Decoding the assembly, the request claims to have 0xffff segments,
      while nvme counts two. This turns out to be because we don't check
      for a data carrying request on the mq scheduler path, and since
      blk_phys_contig_segment() returns true for a non-data request,
      we decrement the initial segment count of 0 and end up with
      0xffff in the unsigned short.
      There are a few issues here:
      1) We should initialize the segment count for a discard to 1.
      2) The discard merging is currently using the data limits for
         segments and sectors.
      Fix this up by having attempt_merge() correctly identify the
      request, and by initializing the segment count correctly
      for discards.
      This can only be triggered with mq-deadline on discard capable
      devices right now, which isn't a common configuration.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  2. 02 Nov, 2017 1 commit
    • Greg Kroah-Hartman's avatar
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman authored
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      How this work was done:
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
      All documentation files were explicitly excluded.
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
         For non */uapi/* files that summary was:
         SPDX license identifier                            # files
         GPL-2.0                                              11139
         and resulted in the first patch in this series.
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
         SPDX license identifier                            # files
         GPL-2.0 WITH Linux-syscall-note                        930
         and resulted in the second patch in this series.
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
         SPDX license identifier                            # files
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
         and that resulted in the third patch in this series.
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: default avatarKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: default avatarPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  3. 23 Aug, 2017 1 commit
    • Christoph Hellwig's avatar
      block: replace bi_bdev with a gendisk pointer and partitions index · 74d46992
      Christoph Hellwig authored
      This way we don't need a block_device structure to submit I/O.  The
      block_device has different life time rules from the gendisk and
      request_queue and is usually only available when the block device node
      is open.  Other callers need to explicitly create one (e.g. the lightnvm
      passthrough code, or the new nvme multipathing code).
      For the actual I/O path all that we need is the gendisk, which exists
      once per block device.  But given that the block layer also does
      partition remapping we additionally need a partition index, which is
      used for said remapping in generic_make_request.
      Note that all the block drivers generally want request_queue or
      sometimes the gendisk, so this removes a layer of indirection all
      over the stack.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  4. 09 Aug, 2017 1 commit
  5. 27 Jun, 2017 1 commit
  6. 21 Jun, 2017 1 commit
  7. 18 Jun, 2017 4 commits
  8. 08 Apr, 2017 1 commit
  9. 08 Feb, 2017 3 commits
  10. 03 Feb, 2017 2 commits
    • Jens Axboe's avatar
      block: free merged request in the caller · e4d750c9
      Jens Axboe authored
      If we end up doing a request-to-request merge when we have completed
      a bio-to-request merge, we free the request from deep down in that
      path. For blk-mq-sched, the merge path has to hold the appropriate
      lock, but we don't need it for freeing the request. And in fact
      holding the lock is problematic, since we are now calling the
      mq sched put_rq_private() hook with the lock held. Other call paths
      do not hold this lock.
      Fix this inconsistency by ensuring that the caller frees a merged
      request. Then we can do it outside of the lock, making it both more
      efficient and fixing the blk-mq-sched problem of invoking parts of
      the scheduler with an unknown lock state.
      Reported-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
    • Jens Axboe's avatar
      blk-merge: return the merged request · b973cb7e
      Jens Axboe authored
      When we attempt to merge request-to-request, we return a 0/1 if we
      ended up merging or not. Change that to return the pointer to the
      request that we freed. We will use this to move the freeing of
      that request out of the merge logic, so that callers can drop
      locks before freeing the request.
      There should be no functional changes in this patch.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
  11. 17 Jan, 2017 2 commits
  12. 09 Dec, 2016 1 commit
    • Christoph Hellwig's avatar
      block: improve handling of the magic discard payload · f9d03f96
      Christoph Hellwig authored
      Instead of allocating a single unused biovec for discard requests, send
      them down without any payload.  Instead we allow the driver to add a
      "special" payload using a biovec embedded into struct request (unioned
      over other fields never used while in the driver), and overloading
      the number of segments for this case.
      This has a couple of advantages:
       - we don't have to allocate the bio_vec
       - the amount of special casing for discard requests in the block
         layer is significantly reduced
       - using this same scheme for other request types is trivial,
         which will be important for implementing the new WRITE_ZEROES
         op on devices where it actually requires a payload (e.g. SCSI)
       - we can get rid of playing games with the request length, as
         we'll never touch it and completions will work just fine
       - it will allow us to support ranged discard operations in the
         future by merging non-contiguous discard bios into a single
       - last but not least it removes a lot of code
      This patch is the common base for my WIP series for ranges discards and to
      remove discard_zeroes_data in favor of always using REQ_OP_WRITE_ZEROES,
      so it would be good to get it in quickly.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  13. 01 Dec, 2016 2 commits
  14. 28 Oct, 2016 1 commit
  15. 24 Aug, 2016 1 commit
    • Ming Lei's avatar
      block: make sure a big bio is split into at most 256 bvecs · 4d70dca4
      Ming Lei authored
      After arbitrary bio size was introduced, the incoming bio may
      be very big. We have to split the bio into small bios so that
      each holds at most BIO_MAX_PAGES bvecs for safety reason, such
      as bio_clone().
      This patch fixes the following kernel crash:
      > [  172.660142] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
      > [  172.660229] IP: [<ffffffff811e53b4>] bio_trim+0xf/0x2a
      > [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
      > [  172.660399] Oops: 0000 [#1] SMP
      > [...]
      > [  172.664780] Call Trace:
      > [  172.664813]  [<ffffffffa007f3be>] ? raid1_make_request+0x2e8/0xad7 [raid1]
      > [  172.664846]  [<ffffffff811f07da>] ? blk_queue_split+0x377/0x3d4
      > [  172.664880]  [<ffffffffa005fb5f>] ? md_make_request+0xf6/0x1e9 [md_mod]
      > [  172.664912]  [<ffffffff811eb860>] ? generic_make_request+0xb5/0x155
      > [  172.664947]  [<ffffffffa0445c89>] ? prio_io+0x85/0x95 [bcache]
      > [  172.664981]  [<ffffffffa0448252>] ? register_cache_set+0x355/0x8d0 [bcache]
      > [  172.665016]  [<ffffffffa04497d3>] ? register_bcache+0x1006/0x1174 [bcache]
      The issue can be reproduced by the following steps:
      	- create one raid1 over two virtio-blk
      	- build bcache device over the above raid1 and another cache device
      	and bucket size is set as 2Mbytes
      	- set cache mode as writeback
      	- run random write over ext4 on the bcache device
      Fixes: 54efd50b(block: make generic_make_request handle arbitrarily sized bios)
      Reported-by: default avatarSebastian Roesner <sroesner-kernelorg@roesner-online.de>
      Reported-by: default avatarEric Wheeler <bcache@lists.ewheeler.net>
      Cc: stable@vger.kernel.org (4.3+)
      Cc: Shaohua Li <shli@fb.com>
      Acked-by: default avatarKent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: default avatarMing Lei <ming.lei@canonical.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  16. 16 Aug, 2016 1 commit
  17. 07 Aug, 2016 1 commit
    • Jens Axboe's avatar
      block: rename bio bi_rw to bi_opf · 1eff9d32
      Jens Axboe authored
      Since commit 63a4cc24, bio->bi_rw contains flags in the lower
      portion and the op code in the higher portions. This means that
      old code that relies on manually setting bi_rw is most likely
      going to be broken. Instead of letting that brokeness linger,
      rename the member, to force old and out-of-tree code to break
      at compile time instead of at runtime.
      No intended functional changes in this commit.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  18. 21 Jul, 2016 2 commits
    • Damien Le Moal's avatar
      block: Fix front merge check · 17007f39
      Damien Le Moal authored
      For a front merge, the maximum number of sectors of the
      request must be checked against the front merge BIO sector,
      not the current sector of the request.
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@hgst.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Tahsin Erdogan's avatar
      block: do not merge requests without consulting with io scheduler · 72ef799b
      Tahsin Erdogan authored
      Before merging a bio into an existing request, io scheduler is called to
      get its approval first. However, the requests that come from a plug
      flush may get merged by block layer without consulting with io
      In case of CFQ, this can cause fairness problems. For instance, if a
      request gets merged into a low weight cgroup's request, high weight cgroup
      now will depend on low weight cgroup to get scheduled. If high weigt cgroup
      needs that io request to complete before submitting more requests, then it
      will also lose its timeslice.
      Following script demonstrates the problem. Group g1 has a low weight, g2
      and g3 have equal high weights but g2's requests are adjacent to g1's
      requests so they are subject to merging. Due to these merges, g2 gets
      poor disk time allocation.
      cat > cfq-merge-repro.sh << "EOF"
      set -e
      mkdir -p $IO_ROOT
      if ! mount | grep -qw $IO_ROOT; then
        mount -t cgroup none -oblkio $IO_ROOT
      cd $IO_ROOT
      for i in g1 g2 g3; do
        if [ -d $i ]; then
          rmdir $i
      mkdir g1 && echo 10 > g1/blkio.weight
      mkdir g2 && echo 495 > g2/blkio.weight
      mkdir g3 && echo 495 > g3/blkio.weight
      (echo $BASHPID > g1/cgroup.procs &&
       fio --readonly --name name1 --filename /dev/sdb \
           --rw read --size 64k --bs 64k --time_based \
           --runtime=$RUNTIME --offset=0k &> /dev/null)&
      (echo $BASHPID > g2/cgroup.procs &&
       fio --readonly --name name1 --filename /dev/sdb \
           --rw read --size 64k --bs 64k --time_based \
           --runtime=$RUNTIME --offset=64k &> /dev/null)&
      (echo $BASHPID > g3/cgroup.procs &&
       fio --readonly --name name1 --filename /dev/sdb \
           --rw read --size 64k --bs 64k --time_based \
           --runtime=$RUNTIME --offset=256k &> /dev/null)&
      sleep $((RUNTIME+1))
      for i in g1 g2 g3; do
        echo ---- $i ----
        cat $i/blkio.time
      # ./cfq-merge-repro.sh
      ---- g1 ----
      8:16 162
      ---- g2 ----
      8:16 165
      ---- g3 ----
      8:16 686
      After applying the patch:
      # ./cfq-merge-repro.sh
      ---- g1 ----
      8:16 90
      ---- g2 ----
      8:16 445
      ---- g3 ----
      8:16 471
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  19. 09 Jun, 2016 1 commit
  20. 07 Jun, 2016 3 commits
  21. 03 Mar, 2016 1 commit
  22. 23 Jan, 2016 1 commit
  23. 12 Jan, 2016 1 commit
    • Keith Busch's avatar
      block: split bios to max possible length · e36f6204
      Keith Busch authored
      This splits bio in the middle of a vector to form the largest possible
      bio at the h/w's desired alignment, and guarantees the bio being split
      will have some data.
      The criteria for splitting is changed from the max sectors to the h/w's
      optimal sector alignment if it is provided. For h/w that advertise their
      block storage's underlying chunk size, it's a big performance win to not
      submit commands that cross them. If sector alignment is not provided,
      this patch uses the max sectors as before.
      This addresses the performance issue commit d3805611 attempted to
      fix, but was reverted due to splitting logic error.
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Ming Lei <tom.leiming@gmail.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: <stable@vger.kernel.org> # 4.4.x-
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  24. 08 Jan, 2016 1 commit
    • Jens Axboe's avatar
      Revert "block: Split bios on chunk boundaries" · 6126eb24
      Jens Axboe authored
      This reverts commit d3805611.
      If we end up splitting on the first segment, we don't adjust
      the sector count. That results in hitting a BUG() with attempting
      to split 0 sectors.
      As this is just a performance issue and not a regression since
      4.3 release, let's just rever this change. That gives us more
      time to test a real fix for 4.5, which would be marked for
      stable anyway.
  25. 23 Dec, 2015 1 commit
  26. 03 Dec, 2015 1 commit
  27. 30 Nov, 2015 1 commit
  28. 24 Nov, 2015 2 commits