[01:23:06] 08Warning Alert for device pfw3-codfw.wikimedia.org - Inbound interface errors [01:43:07] 08̶W̶a̶r̶n̶i̶n̶g Device pfw3-codfw.wikimedia.org recovered from Inbound interface errors [03:33:06] PROBLEM - puppet last run on mw2266 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz] [04:04:37] RECOVERY - puppet last run on mw2266 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [05:06:30] 10Operations, 10Traffic, 10Wikimania-Hackathon-2018, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review: Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820#4218652 (10Joe) ChronologyProtector uses ` MySQLMasterPos`, which can work both with a GTID-based ma... [06:27:06] PROBLEM - mailman list info on fermium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:28:26] RECOVERY - mailman list info on fermium is OK: HTTP OK: HTTP/1.1 200 OK - 15501 bytes in 9.888 second response time [06:29:56] PROBLEM - puppet last run on puppetdb2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml] [06:56:07] RECOVERY - puppet last run on puppetdb2001 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [07:36:14] reimaging snapshot1007 via the reimage script, please ignore all alerts [08:06:41] (03PS1) 10ArielGlenn: use php7.0 for dumps and related jobs on snapshot1007 now [puppet] - 10https://gerrit.wikimedia.org/r/434141 (https://phabricator.wikimedia.org/T181029) [08:07:28] (03CR) 10ArielGlenn: [C: 032] use php7.0 for dumps and related jobs on snapshot1007 now [puppet] - 10https://gerrit.wikimedia.org/r/434141 (https://phabricator.wikimedia.org/T181029) (owner: 10ArielGlenn) [08:16:42] 10Operations, 10Traffic, 10Wikimania-Hackathon-2018, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review: Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820#1096885 (10Krinkle) There are cases where a cookie doesn't work (specifically, for the log-in use ca... [08:17:15] (03CR) 10Tim Starling: "It shouldn't be conditional." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430068 (https://phabricator.wikimedia.org/T85461) (owner: 10Anomie) [08:20:44] (03CR) 10Tim Starling: [C: 04-1] "The whole file is for Wikimedia wikis, so you don't need to check if you are on WMF before you set things. If someone decided to switch of" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430068 (https://phabricator.wikimedia.org/T85461) (owner: 10Anomie) [08:24:33] !log ariel@tin Started deploy [dumps/dumps@5438d41]: sync after reimage of snapshot1007 [08:24:37] !log ariel@tin Finished deploy [dumps/dumps@5438d41]: sync after reimage of snapshot1007 (duration: 00m 03s) [08:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:28] (03CR) 10Tim Starling: "Is this tested? I'm pretty sure that is not what QSA is for." [puppet] - 10https://gerrit.wikimedia.org/r/429447 (owner: 10Chad) [08:35:39] (03CR) 10Anomie: "> If someone decided to switch off manualRecache, I would prefer it if templates were not among the things that will catastrophically brea" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430068 (https://phabricator.wikimedia.org/T85461) (owner: 10Anomie) [08:37:17] (03PS3) 10Anomie: Raise Scribunto maxLangCacheSize to 200 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430068 (https://phabricator.wikimedia.org/T85461) [08:38:37] (03CR) 10jerkins-bot: [V: 04-1] Raise Scribunto maxLangCacheSize to 200 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430068 (https://phabricator.wikimedia.org/T85461) (owner: 10Anomie) [08:39:57] (03PS4) 10Anomie: Raise Scribunto maxLangCacheSize to 200 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430068 (https://phabricator.wikimedia.org/T85461) [08:41:20] (03CR) 10jerkins-bot: [V: 04-1] Raise Scribunto maxLangCacheSize to 200 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430068 (https://phabricator.wikimedia.org/T85461) (owner: 10Anomie) [08:42:38] (03CR) 10Anomie: "The unit test failure seems to be a Jenkins issue of some sort rather than something to do with this patch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430068 (https://phabricator.wikimedia.org/T85461) (owner: 10Anomie) [08:45:00] (03PS4) 10Krinkle: multiversion: Move vendor/autoload from MWMultiVersion to profiler.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432012 [08:46:23] (03CR) 10jerkins-bot: [V: 04-1] multiversion: Move vendor/autoload from MWMultiVersion to profiler.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432012 (owner: 10Krinkle) [08:47:00] (03CR) 10Krinkle: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432012 (owner: 10Krinkle) [08:48:24] (03CR) 10jerkins-bot: [V: 04-1] multiversion: Move vendor/autoload from MWMultiVersion to profiler.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432012 (owner: 10Krinkle) [08:55:45] (03CR) 10Tim Starling: [C: 032] Raise Scribunto maxLangCacheSize to 200 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430068 (https://phabricator.wikimedia.org/T85461) (owner: 10Anomie) [08:57:27] (03CR) 10jerkins-bot: [V: 04-1] Raise Scribunto maxLangCacheSize to 200 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430068 (https://phabricator.wikimedia.org/T85461) (owner: 10Anomie) [09:13:02] Notice: /Stage[main]/Role::Secureredir::Client/Exec[handle-testing-cert]/returns: /usr/lib/python3/dist-packages/urllib3/connection.py:337: SubjectAltNameWarning: Certificate for deployment-certcentral.deployment-prep.eqiad.wmflabs has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for [09:13:03] details.) [09:13:13] Interesting, our puppet certs do not have SAN? [09:17:13] (03PS3) 10ArielGlenn: keep dump prefetch files longer on dumps generation nfs servers [puppet] - 10https://gerrit.wikimedia.org/r/432378 (https://phabricator.wikimedia.org/T194124) [09:19:35] (03CR) 10ArielGlenn: [C: 032] keep dump prefetch files longer on dumps generation nfs servers [puppet] - 10https://gerrit.wikimedia.org/r/432378 (https://phabricator.wikimedia.org/T194124) (owner: 10ArielGlenn) [09:20:33] (03CR) 10Greg Grossmeier: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430068 (https://phabricator.wikimedia.org/T85461) (owner: 10Anomie) [09:20:39] (03CR) 10Krinkle: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432012 (owner: 10Krinkle) [09:25:12] (03PS1) 10ArielGlenn: keep partial dumps on the dump generation nfs servers, not the web servers [puppet] - 10https://gerrit.wikimedia.org/r/434144 (https://phabricator.wikimedia.org/T194124) [09:32:40] (03PS2) 10ArielGlenn: keep partial dumps on the dump generation nfs servers, not the web servers [puppet] - 10https://gerrit.wikimedia.org/r/434144 (https://phabricator.wikimedia.org/T194124) [09:39:39] (03CR) 10ArielGlenn: [C: 032] keep partial dumps on the dump generation nfs servers, not the web servers [puppet] - 10https://gerrit.wikimedia.org/r/434144 (https://phabricator.wikimedia.org/T194124) (owner: 10ArielGlenn) [10:12:03] 10Operations, 10Research, 10The-Wikipedia-Library, 10Traffic, and 6 others: Set an explicit "Origin When Cross-Origin" referer policy via the meta referrer tag - https://phabricator.wikimedia.org/T87276#1741469 (10kaldari) @TheDJ: The upstream bugs were resolved a year ago, but the error still happens in t... [10:13:47] (03PS2) 10ArielGlenn: update list of active dumps/datasets mirrors [puppet] - 10https://gerrit.wikimedia.org/r/424216 [10:15:15] (03CR) 10ArielGlenn: [C: 032] update list of active dumps/datasets mirrors [puppet] - 10https://gerrit.wikimedia.org/r/424216 (owner: 10ArielGlenn) [10:17:47] 10Operations, 10Analytics, 10Research, 10Traffic, and 6 others: Referrer policy for browsers which only support the old spec - https://phabricator.wikimedia.org/T180921#3773297 (10kaldari) Should we open a separate bug for the Safari issue? (See also T87276#4218911) [10:33:53] (03PS1) 10ArielGlenn: move some dump generator hiera settings to common yaml [puppet] - 10https://gerrit.wikimedia.org/r/434156 [10:34:27] (03PS1) 10WMDE-Fisch: Disable wikidiff2 inline moved paragraphs by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434158 (https://phabricator.wikimedia.org/T194271) [10:34:29] (03CR) 10jerkins-bot: [V: 04-1] move some dump generator hiera settings to common yaml [puppet] - 10https://gerrit.wikimedia.org/r/434156 (owner: 10ArielGlenn) [10:35:44] (03CR) 10jerkins-bot: [V: 04-1] Disable wikidiff2 inline moved paragraphs by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434158 (https://phabricator.wikimedia.org/T194271) (owner: 10WMDE-Fisch) [10:36:50] (03PS2) 10ArielGlenn: move some dump generator hiera settings to common yaml [puppet] - 10https://gerrit.wikimedia.org/r/434156 [10:42:04] (03CR) 10ArielGlenn: [C: 032] move some dump generator hiera settings to common yaml [puppet] - 10https://gerrit.wikimedia.org/r/434156 (owner: 10ArielGlenn) [10:49:45] (03CR) 10jenkins-bot: Raise Scribunto maxLangCacheSize to 200 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430068 (https://phabricator.wikimedia.org/T85461) (owner: 10Anomie) [10:50:45] !log tstarling@tin Synchronized wmf-config/CommonSettings.php: Scribunto maxLangCacheSize (duration: 01m 23s) [10:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:23] (03PS2) 10Urbanecm: Revert "Temp rate limit for arwiki due to mass vandalism" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433987 (https://phabricator.wikimedia.org/T192668) [10:59:21] (03PS2) 10Mark Bergsma: Small fixes [debs/pybal] - 10https://gerrit.wikimedia.org/r/433736 [10:59:22] (03PS1) 10Mark Bergsma: Fix BGP collision detection [debs/pybal] - 10https://gerrit.wikimedia.org/r/434161 [10:59:24] (03PS1) 10Mark Bergsma: Add tests that similate client or server sessions initial connection [debs/pybal] - 10https://gerrit.wikimedia.org/r/434162 [10:59:27] (03PS1) 10Mark Bergsma: Move FSM connect state handling to the FSM itself [debs/pybal] - 10https://gerrit.wikimedia.org/r/434163 [11:02:11] (03PS2) 10WMDE-Fisch: Disable wikidiff2 inline moved paragraphs by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434158 (https://phabricator.wikimedia.org/T194271) [11:02:13] (03PS1) 10ArielGlenn: move last dumps generator hiera settings into the profile [puppet] - 10https://gerrit.wikimedia.org/r/434164 [11:12:47] 10Operations, 10Traffic, 10Wikimania-Hackathon-2018, 10Availability (MediaWiki-MultiDC): Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820#4219104 (10Krinkle) [11:32:20] 10Operations, 10Analytics, 10Research, 10Traffic, and 6 others: Referrer policy for browsers which only support the old spec - https://phabricator.wikimedia.org/T180921#4219155 (10gh87) >>! In T180921#4218922, @kaldari wrote: > Should we open a separate bug for the Safari issue? (See also T87276#4218911)... [11:36:19] (03CR) 10ArielGlenn: [C: 032] move last dumps generator hiera settings into the profile [puppet] - 10https://gerrit.wikimedia.org/r/434164 (owner: 10ArielGlenn) [11:58:25] 10Operations, 10hardware-requests, 10Release-Engineering-Team (Watching / External): eqiad: replacement tin/deployment server - https://phabricator.wikimedia.org/T174452#4219191 (10Reedy) [11:58:33] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#4219190 (10Reedy) [11:58:54] 10Operations, 10hardware-requests, 10Release-Engineering-Team (Watching / External): eqiad: replacement tin/deployment server - https://phabricator.wikimedia.org/T174452#3562461 (10Reedy) [11:59:03] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#4004075 (10Reedy) [12:08:05] (03PS2) 10ArielGlenn: turn off misc dump crons on snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/432365 (https://phabricator.wikimedia.org/T181936) [12:11:31] (03PS2) 10ArielGlenn: add snapshot1008 role and hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/432367 (https://phabricator.wikimedia.org/T181936) [12:21:35] !log reboot labtestneutron2002.codfw.wmnet [12:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:42] 10Operations, 10Analytics, 10Research, 10Traffic, and 6 others: Referrer policy for browsers which only support the old spec - https://phabricator.wikimedia.org/T180921#3773297 (10TheDJ) This works now in iOS 11.1 (13605.1.33.1.2) I think: Wikidata to Wikidata {F18377524} From wikidata.org to google. {F... [12:38:46] 10Operations, 10Research, 10The-Wikipedia-Library, 10Traffic, and 6 others: Set an explicit "Origin When Cross-Origin" referer policy via the meta referrer tag - https://phabricator.wikimedia.org/T87276#4219291 (10TheDJ) It's just complaining about the old misspelled value that we have in our chain current... [14:09:21] huh, you can't ensure => absent a Uwsgi::App without settings => {} [14:47:41] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [14:47:42] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [14:47:51] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [14:48:02] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [14:48:21] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [14:48:41] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [14:48:41] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [14:52:31] PROBLEM - Check the NTP synchronisation status of timesyncd on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:00:21] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [15:00:31] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [15:00:32] RECOVERY - DPKG on stat1005 is OK: All packages OK [15:00:51] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [15:01:21] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [15:01:21] RECOVERY - Disk space on stat1005 is OK: DISK OK [15:04:02] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:22:41] RECOVERY - Check the NTP synchronisation status of timesyncd on stat1005 is OK: OK: synced at Sun 2018-05-20 15:22:38 UTC. [15:26:38] (03CR) 10Brian Wolff: "I'd personally like to go with (for the immediate fix to return somewhat to the status quo)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433988 (https://phabricator.wikimedia.org/T194864) (owner: 10Urbanecm) [15:31:53] (03PS4) 10Urbanecm: Raise the rate limits for Commons to higher values than global 90 edits/minute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433988 (https://phabricator.wikimedia.org/T194864) [15:32:40] (03PS5) 10Urbanecm: Raise the rate limits for Commons to higher values than global 90 edits/minute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433988 (https://phabricator.wikimedia.org/T194864) [15:33:25] (03CR) 10Brian Wolff: [C: 031] Raise the rate limits for Commons to higher values than global 90 edits/minute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433988 (https://phabricator.wikimedia.org/T194864) (owner: 10Urbanecm) [15:33:41] (03CR) 10Urbanecm: "@Bawolff Ok, let's change it to your values. I trust you on this number-deciding more than I trust myself :)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433988 (https://phabricator.wikimedia.org/T194864) (owner: 10Urbanecm) [15:34:00] (03CR) 10jerkins-bot: [V: 04-1] Raise the rate limits for Commons to higher values than global 90 edits/minute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433988 (https://phabricator.wikimedia.org/T194864) (owner: 10Urbanecm) [16:53:11] 10Operations, 10ops-codfw, 10fundraising-tech-ops: Interface errors on pfw3a-codfw:xe-0/0/17 - https://phabricator.wikimedia.org/T195216#4219677 (10ayounsi) p:05Triage>03Normal [17:23:01] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet operation_type={container_status,create_container,image_status,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:23:11] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:24:02] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:24:13] akosiaris ^^ [17:24:22] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:24:22] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet operation_type={container_status,list_podsandbox,podsandbox_status,remove_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:25:21] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:25:31] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:25:41] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:33:46] 10Operations, 10Traffic, 10Browser-Support-Apple-Safari, 10Upstream: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#4219769 (10TheDJ) @nuria so this should work now, can we confirm that from the stats ? [18:34:19] (03PS5) 10Urbanecm: Initial configuration for pmswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433830 (https://phabricator.wikimedia.org/T194879) [18:46:24] (03PS6) 10Urbanecm: Raise the rate limits for Commons to higher values than global 90 edits/minute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433988 (https://phabricator.wikimedia.org/T194864) [18:49:34] (03CR) 10jerkins-bot: [V: 04-1] Raise the rate limits for Commons to higher values than global 90 edits/minute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433988 (https://phabricator.wikimedia.org/T194864) (owner: 10Urbanecm) [19:55:19] (03PS7) 10Urbanecm: Raise the rate limits for Commons to higher values than global 90 edits/minute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433988 (https://phabricator.wikimedia.org/T194864) [22:07:11] PROBLEM - Disk space on elastic1019 is CRITICAL: DISK CRITICAL - free space: /srv 61926 MB (12% inode=99%) [22:14:51] RECOVERY - Disk space on elastic1019 is OK: DISK OK [22:19:19] 10Operations, 10Traffic, 10Wikimedia-Hackathon-2018: Create and deploy a centralized letsencrypt service - https://phabricator.wikimedia.org/T194962#4219996 (10Krenair) I ran into a UWSGI / Python 3 segfault today, while my central service script is running under UWSGI and calling acme_tiny - see https://pha... [22:23:52] 10Operations, 10Cloud-VPS: Cannot add or update records under DNS zones in Horizon - https://phabricator.wikimedia.org/T195059#4219999 (10Krenair) For now I have resorted to creating krenair.hopto.org and pointing it at Labs [22:26:04] 10Operations, 10Traffic, 10Wikimedia-Hackathon-2018: Create and deploy a centralized letsencrypt service - https://phabricator.wikimedia.org/T194962#4220004 (10Krenair) Some of the DNS challenge stuff we'll look at later might benefit from what I put together for T182927 - as our current acme_tiny does HTTP...