[00:13:49] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366 (10kruusamagi) Any progress on that? Just a proposal: might it be possible to give project front pages some special status or something for that cache upd... [00:43:12] (03PS3) 10MarkAHershberger: Add the current time to the tag for the nightly build [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/495158 [00:44:19] (03CR) 10jerkins-bot: [V: 04-1] Add the current time to the tag for the nightly build [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/495158 (owner: 10MarkAHershberger) [01:04:10] (03PS4) 10MarkAHershberger: Add the current time to the tag for the nightly build [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/495158 [01:04:40] (03PS5) 10MarkAHershberger: Add the current time to the tag for the nightly build [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/495158 [01:57:26] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [01:58:30] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 1.780 second response time https://phabricator.wikimedia.org/T174916 [02:25:00] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [02:29:40] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 4.348 second response time https://phabricator.wikimedia.org/T174916 [02:34:36] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [03:21:20] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 3.030 second response time https://phabricator.wikimedia.org/T174916 [03:25:02] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [03:39:35] (03PS1) 10GTirloni: openldap: Use newer slapd from stretch-backports [puppet] - 10https://gerrit.wikimedia.org/r/495490 (https://phabricator.wikimedia.org/T217280) [03:40:12] (03CR) 10jerkins-bot: [V: 04-1] openldap: Use newer slapd from stretch-backports [puppet] - 10https://gerrit.wikimedia.org/r/495490 (https://phabricator.wikimedia.org/T217280) (owner: 10GTirloni) [03:51:20] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 8.808 second response time https://phabricator.wikimedia.org/T174916 [03:54:58] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [03:57:12] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.074 second response time https://phabricator.wikimedia.org/T174916 [03:59:32] PROBLEM - Device not healthy -SMART- on phab1002 is CRITICAL: cluster=misc device=sdc instance=phab1002:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=phab1002&var-datasource=eqiad+prometheus/ops [04:00:58] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [04:42:50] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.853 second response time https://phabricator.wikimedia.org/T174916 [04:46:36] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [04:50:04] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 5.069 second response time https://phabricator.wikimedia.org/T174916 [04:53:46] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [05:07:00] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 5.743 second response time https://phabricator.wikimedia.org/T174916 [05:10:40] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [05:34:22] (03PS2) 10GTirloni: openldap: Use newer slapd from stretch-backports [puppet] - 10https://gerrit.wikimedia.org/r/495490 (https://phabricator.wikimedia.org/T217280) [05:36:00] (03CR) 10Andrew Bogott: [C: 03+1] ""how could it be worse?" he asked, right before it got worse" [puppet] - 10https://gerrit.wikimedia.org/r/495490 (https://phabricator.wikimedia.org/T217280) (owner: 10GTirloni) [05:40:32] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 1.629 second response time https://phabricator.wikimedia.org/T174916 [05:46:40] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [06:03:04] 10Operations, 10serviceops: Decide whether to keep violating OpenAPI/Swagger specification in our REST services - https://phabricator.wikimedia.org/T217881 (10Petar.petkovic) [06:11:46] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.042 second response time https://phabricator.wikimedia.org/T174916 [06:15:30] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [06:38:30] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 7.332 second response time https://phabricator.wikimedia.org/T174916 [06:40:38] PROBLEM - Host mw1280 is DOWN: PING CRITICAL - Packet loss = 100% [06:42:08] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [07:14:07] <_joe_> !log restarting pdfrender on scb1004 [07:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:28] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.074 second response time https://phabricator.wikimedia.org/T174916 [07:19:40] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [09:01:29] (03PS1) 10Ammarpad: Add new throttle rule for LMU Edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495494 (https://phabricator.wikimedia.org/T217929) [09:55:55] (03CR) 10Framawiki: [C: 04-1] Add new throttle rule for LMU Edit-a-thon (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495494 (https://phabricator.wikimedia.org/T217929) (owner: 10Ammarpad) [10:44:20] (03PS2) 10Ammarpad: Add new throttle rule for LMU Edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495494 (https://phabricator.wikimedia.org/T217929) [10:47:48] (03CR) 10Ammarpad: Add new throttle rule for LMU Edit-a-thon (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495494 (https://phabricator.wikimedia.org/T217929) (owner: 10Ammarpad) [11:28:56] PROBLEM - Check systemd state on ms-be2040 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:57:22] RECOVERY - Check systemd state on ms-be2040 is OK: OK - running: The system is fully operational [12:11:23] (03CR) 10Framawiki: [C: 03+1] Add new throttle rule for LMU Edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495494 (https://phabricator.wikimedia.org/T217929) (owner: 10Ammarpad) [13:33:29] (03CR) 10Alexandros Kosiaris: [C: 03+2] "LGTM, thanks!" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/495158 (owner: 10MarkAHershberger) [13:34:02] (03CR) 10jenkins-bot: Add the current time to the tag for the nightly build [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/495158 (owner: 10MarkAHershberger) [13:34:16] (03CR) 10Alexandros Kosiaris: [C: 03+1] openldap: Use newer slapd from stretch-backports [puppet] - 10https://gerrit.wikimedia.org/r/495490 (https://phabricator.wikimedia.org/T217280) (owner: 10GTirloni) [15:41:54] PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:44:16] RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:58:37] (03PS3) 10Paladox: Merge branch 'stable-2.16' into wmf/stable-2.16 [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/495424 [17:21:26] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10Aklapper) [17:55:52] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10Aklapper) [18:08:29] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10Aklapper) [18:31:08] PROBLEM - puppet last run on ganeti1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:57:37] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 8 others: Investigate more efficient memcached solution for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T97368 (10elukey) Quick comment: I noticed while checking some memcache... [18:57:51] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 8 others: Investigate more efficient memcached solution for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T97368 (10elukey) 05Resolved→03Open [18:58:49] addshore: ---^ :) [18:59:10] yay [19:00:11] elukey: it should remain pretty stable [19:01:20] added it to our team board again to take another look at it, as it really needs to go away [19:01:22] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 8 others: Investigate more efficient memcached solution for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T97368 (10Addshore) [19:02:14] RECOVERY - puppet last run on ganeti1005 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [19:03:17] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 8 others: Investigate more efficient memcached solution for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T97368 (10elukey) Correction: the last bursts of mcrouter's timeouts se... [19:03:33] addshore: thanks a lot! It seems to me that the bursts of mcrouter timeouts (that triggered the memcached mw exceptions logged in this chan today) [19:03:48] correlates with a burst of SET for the slab where the key is [19:04:10] will try to add more data for tomorrow [19:05:24] * elukey sends wikilove to addshore [19:08:46] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: SVG fails to render properly due to several issues - https://phabricator.wikimedia.org/T46016 (10Aklapper) >>! In T46016#2437515, @MoritzMuehlenhoff wrote: > the re-rendered version still shows the info box on broken rendering and the shadow is missing. U... [19:32:58] PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [19:35:20] RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [22:27:25] PROBLEM - Disk space on labstore1004 is CRITICAL: DISK CRITICAL - free space: /srv/tools 473948 MB (5% inode=79%) [22:28:30] (03PS4) 10GTirloni: ldap: increase group TTL from 60 to 300 seconds in labs [puppet] - 10https://gerrit.wikimedia.org/r/494922 (https://phabricator.wikimedia.org/T217280) [22:30:06] (03CR) 10GTirloni: [C: 03+2] ldap: increase group TTL from 60 to 300 seconds in labs [puppet] - 10https://gerrit.wikimedia.org/r/494922 (https://phabricator.wikimedia.org/T217280) (owner: 10GTirloni) [22:35:47] !log toolforge stretch: increased nscd group TTL from 60 to 300sec (T217280) [22:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:50] T217280: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 [22:46:04] ACKNOWLEDGEMENT - Disk space on labstore1004 is CRITICAL: DISK CRITICAL - free space: /srv/tools 466872 MB (5% inode=79%): Bstorm Working on disk space via T217993 [22:51:25] RECOVERY - Disk space on labstore1004 is OK: DISK OK