[03:20:33] (03PS2) 10Krinkle: Revert "vcl: move XWD pass logic to wm_common" [puppet] - 10https://gerrit.wikimedia.org/r/552507 (https://phabricator.wikimedia.org/T233768) (owner: 10Ema) [03:24:00] (03CR) 10Krinkle: "While this site does not use MW, it does use PHP, and is served from a regular app server (mwmaint are basically depooled appserver with s" [puppet] - 10https://gerrit.wikimedia.org/r/552508 (https://phabricator.wikimedia.org/T233768) (owner: 10Ema) [03:30:39] 10Operations, 10Traffic, 10fixcopyright.wikimedia.org, 10Core Platform Team Workboards (Clinic Duty Team), and 4 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Krinkle) Funneling the domain to a static page is easy and seems prudent indeed, in particular as the dom... [04:38:25] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 57.27 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:45:17] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 80.12 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:04:23] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 57.57 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:11:13] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 77.24 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:21:31] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 36.67 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:30:07] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 72.2 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:34:01] PROBLEM - MegaRAID on ganeti2002 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:34:02] ACKNOWLEDGEMENT - MegaRAID on ganeti2002 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T239009 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:34:34] 10Operations, 10ops-codfw: Degraded RAID on ganeti2002 - https://phabricator.wikimedia.org/T239009 (10ops-monitoring-bot) [07:35:03] PROBLEM - snapshot of s3 in eqiad on db1115 is CRITICAL: snapshot for s3 at eqiad taken more than 4 days ago: Most recent backup 2019-11-20 07:12:17 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [07:55:06] (03CR) 10Giuseppe Lavagetto: kubernetes::deployment_server: Add a private/general.yaml file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/549872 (https://phabricator.wikimedia.org/T237234) (owner: 10Giuseppe Lavagetto) [08:00:20] (03CR) 10Giuseppe Lavagetto: "> While this site does not use MW, it does use PHP, and is served" [puppet] - 10https://gerrit.wikimedia.org/r/552508 (https://phabricator.wikimedia.org/T233768) (owner: 10Ema) [08:11:15] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 54.89 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:19:49] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 73.15 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:35:13] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 56.62 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:40:49] 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Fuzzy) I prefer to set no time limit. This is administrative task for the project, as long as I'm an administrator at the Hebrew Wikisource. If definite time li... [09:43:45] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 77.98 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [10:00:03] 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Ijon) I endorse this. [10:14:13] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:19:43] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 58.51 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [10:38:33] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 88.12 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [10:39:51] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:26:29] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 57.64 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [11:28:58] 10Operations, 10Traffic, 10fixcopyright.wikimedia.org, 10Core Platform Team Workboards (Clinic Duty Team), and 3 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10MarcoAurelio) [11:29:55] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 89.3 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [12:00:54] (03CR) 10Volans: [C: 04-1] "It seems to miss some pieces and has some issue, see inline for the details." (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/551948 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [12:14:58] (03CR) 10Volans: [C: 04-1] "Needs rebasing based on recent changes to the reports running mechanism" [puppet] - 10https://gerrit.wikimedia.org/r/550053 (owner: 10CRusnov) [12:15:35] (03PS2) 10Volans: homer: move git peer to hiera [puppet] - 10https://gerrit.wikimedia.org/r/551195 [12:22:07] (03CR) 10Volans: [C: 03+2] "Effectively a noop in prod." [puppet] - 10https://gerrit.wikimedia.org/r/551195 (owner: 10Volans) [12:23:01] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 55.53 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [12:31:37] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 81.63 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:24:53] !log rebooting snapshot1008 to clear up some nfs + kernel issues [14:24:57] ...we hope [14:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:39] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 54.01 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:45:05] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 82.02 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:55:58] 1/2 cleared up. [15:01:39] !log rebooting dumpsdata1002 to clear up the other half of the nfs issues [15:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:50] maybe [15:37:20] 10Operations, 10Dumps-Generation, 10Patch-For-Review: Migrate dumpsdata hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224563 (10ArielGlenn) Adds-changes dumps did not run properly; when I checked this afternoon the Nov 23 job was hung indefinitely trying to get a lockfile on the first wiki to... [15:39:27] (03PS1) 10ArielGlenn: add ability to skip locking for adds-changes dumps [dumps] - 10https://gerrit.wikimedia.org/r/552658 (https://phabricator.wikimedia.org/T224563) [15:40:46] (03CR) 10ArielGlenn: [C: 03+2] add ability to skip locking for adds-changes dumps [dumps] - 10https://gerrit.wikimedia.org/r/552658 (https://phabricator.wikimedia.org/T224563) (owner: 10ArielGlenn) [15:41:46] !log ariel@deploy1001 Started deploy [dumps/dumps@bfdea34]: can skip locks for misc dumps [15:41:49] !log ariel@deploy1001 Finished deploy [dumps/dumps@bfdea34]: can skip locks for misc dumps (duration: 00m 03s) [15:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:26] (03PS1) 10ArielGlenn: configure adds-changes dumps to skip locking for now [puppet] - 10https://gerrit.wikimedia.org/r/552659 [15:46:38] (03CR) 10jerkins-bot: [V: 04-1] configure adds-changes dumps to skip locking for now [puppet] - 10https://gerrit.wikimedia.org/r/552659 (owner: 10ArielGlenn) [15:49:09] 10Operations, 10Traffic, 10fixcopyright.wikimedia.org, 10Core Platform Team Workboards (Clinic Duty Team), and 3 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10CCicalese_WMF) @Krinkle Agreed. Legal has requested that we send traffic to the first URL you mention, ht... [15:49:25] (03PS2) 10ArielGlenn: configure adds-changes dumps to skip locking for now [puppet] - 10https://gerrit.wikimedia.org/r/552659 (https://phabricator.wikimedia.org/T224563) [15:52:13] (03CR) 10ArielGlenn: [C: 03+2] configure adds-changes dumps to skip locking for now [puppet] - 10https://gerrit.wikimedia.org/r/552659 (https://phabricator.wikimedia.org/T224563) (owner: 10ArielGlenn) [16:48:19] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 49.13 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:55:07] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 73.16 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [18:41:55] PROBLEM - Disk space on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics-tool1001&var-datasource=eqiad+prometheus/ops [18:42:03] PROBLEM - configured eth on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [18:42:05] PROBLEM - Check systemd state on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:42:29] PROBLEM - Check size of conntrack table on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [18:42:49] PROBLEM - dhclient process on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [18:42:55] PROBLEM - Check whether ferm is active by checking the default input chain on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:42:57] PROBLEM - DPKG on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [18:43:19] PROBLEM - SSH on analytics-tool1001 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:46:35] PROBLEM - puppet last run on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [18:55:19] RECOVERY - SSH on analytics-tool1001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:01:47] PROBLEM - Check the NTP synchronisation status of timesyncd on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [19:03:43] RECOVERY - puppet last run on analytics-tool1001 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:04:11] RECOVERY - Disk space on analytics-tool1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics-tool1001&var-datasource=eqiad+prometheus/ops [19:04:19] RECOVERY - configured eth on analytics-tool1001 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [19:04:21] RECOVERY - Check systemd state on analytics-tool1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:04:45] RECOVERY - Check size of conntrack table on analytics-tool1001 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [19:05:03] RECOVERY - dhclient process on analytics-tool1001 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [19:05:09] RECOVERY - Check whether ferm is active by checking the default input chain on analytics-tool1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:05:13] RECOVERY - DPKG on analytics-tool1001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [19:11:46] 10Operations, 10Wikimedia-Mailing-lists: Create mailing list for project GLOW - https://phabricator.wikimedia.org/T238607 (10Moushira) Thanks @Aklapper , Mailinglist name: GLOW@lists.wikimedia.org The purpose is to have all project team members of all countries, in addition to WMF staff, share news, insight... [19:32:31] RECOVERY - Check the NTP synchronisation status of timesyncd on analytics-tool1001 is OK: OK: synced at Sun 2019-11-24 19:32:29 UTC. https://wikitech.wikimedia.org/wiki/NTP [20:18:59] (03PS4) 10Tpt: Properly configures the Wikisource extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551254 (https://phabricator.wikimedia.org/T236502) [20:22:17] (03PS5) 10Tpt: Properly configures the Wikisource extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551254 (https://phabricator.wikimedia.org/T239034) [20:53:03] RECOVERY - Check whether microcode mitigations for CPU vulnerabilities are applied on dumpsdata1002 is OK: OK - All expected CPU flags found https://wikitech.wikimedia.org/wiki/Microcode [22:10:27] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 59.59 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:15:37] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 94.23 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [23:22:06] (03PS3) 10Andrew Bogott: wmf_sink: remove instance-puppet git entries for deleted VMs [puppet] - 10https://gerrit.wikimedia.org/r/552583 (https://phabricator.wikimedia.org/T238708) [23:22:08] (03PS1) 10Andrew Bogott: designate: fix puppet_git_repo_name for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/552675 (https://phabricator.wikimedia.org/T238708) [23:25:36] (03CR) 10Andrew Bogott: [C: 03+2] designate: fix puppet_git_repo_name for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/552675 (https://phabricator.wikimedia.org/T238708) (owner: 10Andrew Bogott) [23:41:07] (03PS4) 10Andrew Bogott: wmf_sink: remove instance-puppet git entries for deleted VMs [puppet] - 10https://gerrit.wikimedia.org/r/552583 (https://phabricator.wikimedia.org/T238708) [23:54:59] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 59.35 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [23:58:27] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 87.16 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1