[00:01:21] (03CR) 10Dzahn: "ping" [puppet] - 10https://gerrit.wikimedia.org/r/295880 (https://phabricator.wikimedia.org/T138506) (owner: 10Hashar) [00:02:01] (03PS1) 10Ppchelko: Change-Prop: Enable file transclusion updates [puppet] - 10https://gerrit.wikimedia.org/r/306308 [00:03:29] (03PS3) 10BBlack: varnish: remove libgeoip from text VCL compilation [puppet] - 10https://gerrit.wikimedia.org/r/305648 (https://phabricator.wikimedia.org/T99226) [00:03:48] (03CR) 10BBlack: [C: 032 V: 032] varnish: remove libgeoip from text VCL compilation [puppet] - 10https://gerrit.wikimedia.org/r/305648 (https://phabricator.wikimedia.org/T99226) (owner: 10BBlack) [00:05:55] (03PS8) 10BBlack: Remove geoiplookup service IPs from LVS [puppet] - 10https://gerrit.wikimedia.org/r/305420 (https://phabricator.wikimedia.org/T100902) [00:05:57] (03PS8) 10BBlack: text VCL: remove JSON output support [puppet] - 10https://gerrit.wikimedia.org/r/305421 (https://phabricator.wikimedia.org/T100902) [00:05:59] (03PS1) 10BBlack: text VCL: remove geoiplookup hostname support [puppet] - 10https://gerrit.wikimedia.org/r/306309 (https://phabricator.wikimedia.org/T100902) [00:07:26] (03PS1) 10Dzahn: recdns: remove chromium from LVS nameservers_override [puppet] - 10https://gerrit.wikimedia.org/r/306311 [00:07:47] (03CR) 10Dzahn: [C: 04-1] "just preparing changes for tomorrow" [puppet] - 10https://gerrit.wikimedia.org/r/306311 (owner: 10Dzahn) [00:08:44] (03PS2) 10Ppchelko: Change-Prop: Enable file transclusion updates [puppet] - 10https://gerrit.wikimedia.org/r/306308 [00:11:34] (03CR) 10Ppchelko: "https://puppet-compiler.wmflabs.org/3815/" [puppet] - 10https://gerrit.wikimedia.org/r/306308 (owner: 10Ppchelko) [00:30:09] PROBLEM - MD RAID on wtp2016 is CRITICAL: CRITICAL: Active: 3, Working: 3, Failed: 1, Spare: 0 [00:39:45] 06Operations, 10Wikimedia-Mailing-lists: deactivate maint-announce - https://phabricator.wikimedia.org/T143760#2578015 (10Dzahn) [00:40:01] 06Operations, 10Wikimedia-Mailing-lists: deactivate maint-announce - https://phabricator.wikimedia.org/T143760#2578027 (10Dzahn) a:03Dzahn [00:55:06] (03CR) 10BryanDavis: [C: 031] "Seems fine to me. The puppet config for the existing labs instances that use this will need to be updated in coordination with merging. I " [puppet] - 10https://gerrit.wikimedia.org/r/298906 (owner: 10Dzahn) [01:09:38] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: Puppet has 1 failures [01:19:48] PROBLEM - MariaDB Slave Lag: m3 on db1043 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1181.65 seconds [01:33:27] Hmm, 2 extensions do not seem to be showing their foo-desc messages on Special:Version 🤔 [01:34:19] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [01:44:28] RECOVERY - MariaDB Slave Lag: m3 on db1043 is OK: OK slave_sql_lag Replication lag: 0.43 seconds [01:57:19] !log temp stop dhcp service on install2001 - debug [01:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:01:27] 602 Aug 24 01:56:41 install2001 puppet-agent[16637]: Disabling Puppet. [02:01:30] 603 Aug 24 01:58:46 install2001 systemd[1]: Starting LSB: DHCP server... [02:01:33] are you kidding me, [02:10:08] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:10:48] 06Operations, 10ops-codfw, 06Discovery: codfw: rack/setup/deploy wdqs200[12]switch configuration - https://phabricator.wikimedia.org/T143613#2578077 (10Papaul) [02:11:58] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [02:25:48] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.15) (duration: 10m 16s) [02:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:00:44] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.16) (duration: 18m 24s) [03:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:07:39] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Aug 24 03:07:38 UTC 2016 (duration 6m 55s) [03:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:25:28] PROBLEM - puppet last run on mw2186 is CRITICAL: CRITICAL: puppet fail [03:53:28] RECOVERY - puppet last run on mw2186 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [03:54:48] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:58:47] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [04:33:28] PROBLEM - puppet last run on mw2181 is CRITICAL: CRITICAL: puppet fail [04:40:58] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:42:48] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [04:48:58] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:52:58] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [05:01:27] RECOVERY - puppet last run on mw2181 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [05:44:49] (03PS8) 10Madhuvishy: nfs: Modify /data/scratch on nfs clients to point to mount from labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/306019 (https://phabricator.wikimedia.org/T134896) [07:03:17] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 209, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [07:04:16] (03CR) 10Hashar: "Sorry Daniel, I have seen your previous notification, that is in my backlog of things to check / polish up this week." [puppet] - 10https://gerrit.wikimedia.org/r/295880 (https://phabricator.wikimedia.org/T138506) (owner: 10Hashar) [07:07:08] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [07:08:27] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [07:11:56] I"ll be out for an hour or so, running errands [07:16:47] 06Operations, 06Commons, 10Wikimedia-SVG-rendering, 10media-storage: Install mscorefonts on scaling servers for SVG rendering - https://phabricator.wikimedia.org/T140141#2578259 (10MoritzMuehlenhoff) That's a good point. I'll doublecheck with Legal on their interpretation on the EULA on shipping only the f... [07:17:56] 06Operations, 10ops-codfw, 10DBA: es2004 has a dead disk, but it is not under warranty - https://phabricator.wikimedia.org/T143220#2578260 (10jcrespo) Ok, now things are nice: it says `Unconfigured(good), Spun Up` rather than Failed. [07:21:08] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [07:22:18] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [07:30:15] 06Operations, 10ops-codfw, 10DBA: es2004 has a dead disk, but it is not under warranty - https://phabricator.wikimedia.org/T143220#2578268 (10jcrespo) Now rebuilding, I didn't know the drive didn't rebuild automatically. ``` root@es2004:~$ megacli -Pdgetmissing -a0 A... [07:35:18] 06Operations, 06Commons, 10Wikimedia-SVG-rendering, 10media-storage: Install mscorefonts on scaling servers for SVG rendering - https://phabricator.wikimedia.org/T140141#2578281 (10MoritzMuehlenhoff) Actually, since we're planning to build a custom package anyway, we can simply choose to ship all fonts, bu... [07:40:42] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-collaborations, 10Research-management: Request access to data for WDQS research - https://phabricator.wikimedia.org/T142780#2578286 (10mkroetzsch) Regarding my remaining todos: * I have signed the L3 doc * Here is my dedicated productio... [07:43:17] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 211, down: 0, dormant: 0, excluded: 0, unused: 0 [07:43:49] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [07:46:03] (03PS2) 10Muehlenhoff: Disable unprivileged user namespaces on trusty systems [puppet] - 10https://gerrit.wikimedia.org/r/304474 (https://phabricator.wikimedia.org/T142567) [07:47:09] PROBLEM - HHVM jobrunner on mw2154 is CRITICAL: No route to host [07:48:30] FYI just got a 503 on wiki if anyone sees others [07:48:33] Request from 2601:640:8100:62:95ba:628b:cfc:ba8c via cp2001 cp2001, Varnish XID 3536412361 [07:48:33] Error: 503, Service Unavailable at Wed, 24 Aug 2016 07:47:30 GMT [07:48:45] (random user page) [07:49:08] RECOVERY - HHVM jobrunner on mw2154 is OK: HTTP OK: HTTP/1.1 200 OK - 222 bytes in 0.076 second response time [07:49:47] PROBLEM - puppet last run on restbase2002 is CRITICAL: CRITICAL: puppet fail [07:50:18] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [07:50:49] PROBLEM - puppet last run on cp2016 is CRITICAL: CRITICAL: Puppet has 1 failures [07:51:47] PROBLEM - puppet last run on mw2123 is CRITICAL: CRITICAL: Puppet has 1 failures [07:51:47] PROBLEM - puppet last run on mw2067 is CRITICAL: CRITICAL: Puppet has 1 failures [07:51:57] PROBLEM - puppet last run on labstore2003 is CRITICAL: CRITICAL: puppet fail [07:51:58] PROBLEM - puppet last run on mw2249 is CRITICAL: Timeout while attempting connection [07:52:09] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: puppet fail [07:52:49] PROBLEM - puppet last run on ms-fe2003 is CRITICAL: CRITICAL: Puppet has 1 failures [07:52:49] PROBLEM - puppet last run on mw2142 is CRITICAL: CRITICAL: Puppet has 1 failures [07:52:59] PROBLEM - puppet last run on mw2092 is CRITICAL: CRITICAL: Puppet has 1 failures [07:52:59] PROBLEM - puppet last run on mw2131 is CRITICAL: CRITICAL: Puppet has 1 failures [07:53:00] PROBLEM - puppet last run on ms-be2009 is CRITICAL: CRITICAL: Puppet has 1 failures [07:53:49] PROBLEM - Disk space on es2001 is CRITICAL: Connection refused or timed out [07:53:49] PROBLEM - Apache HTTP on mw2240 is CRITICAL: No route to host [07:53:50] PROBLEM - puppet last run on mw2242 is CRITICAL: CRITICAL: puppet fail [07:54:08] PROBLEM - CirrusSearch codfw 95th percentile latency - more_like on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [2000.0] [07:55:32] <_joe_> Jamesofur: looks like a codfw issue [07:55:49] RECOVERY - Apache HTTP on mw2240 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.110 second response time [07:56:45] fun :-/ thanks _joe_ [07:57:49] PROBLEM - puppet last run on ms-be2013 is CRITICAL: CRITICAL: Puppet has 2 failures [07:58:17] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [08:00:18] RECOVERY - CirrusSearch codfw 95th percentile latency - more_like on graphite1001 is OK: OK: Less than 20.00% above the threshold [1200.0] [08:02:48] PROBLEM - puppet last run on ms-be2021 is CRITICAL: Connection refused or timed out [08:03:08] PROBLEM - puppet last run on elastic2024 is CRITICAL: CRITICAL: Puppet has 2 failures [08:03:17] PROBLEM - puppet last run on ms-be2011 is CRITICAL: CRITICAL: Puppet has 1 failures [08:04:37] (03PS1) 10Ema: Disable codfw [dns] - 10https://gerrit.wikimedia.org/r/306342 [08:05:08] PROBLEM - puppet last run on mw2166 is CRITICAL: CRITICAL: Puppet has 1 failures [08:06:08] PROBLEM - puppet last run on ms-be2024 is CRITICAL: CRITICAL: Puppet has 1 failures [08:06:18] (03CR) 10Jcrespo: [C: 031] Disable codfw [dns] - 10https://gerrit.wikimedia.org/r/306342 (owner: 10Ema) [08:06:28] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [08:07:07] PROBLEM - puppet last run on mw2204 is CRITICAL: CRITICAL: Puppet has 1 failures [08:07:28] PROBLEM - puppet last run on mw2215 is CRITICAL: CRITICAL: Puppet has 1 failures [08:07:28] PROBLEM - puppet last run on mw2239 is CRITICAL: CRITICAL: puppet fail [08:07:29] PROBLEM - puppet last run on db2053 is CRITICAL: CRITICAL: Puppet has 1 failures [08:08:09] PROBLEM - puppet last run on wtp2006 is CRITICAL: CRITICAL: Puppet has 3 failures [08:08:24] (03CR) 10Ema: [C: 032] Disable codfw [dns] - 10https://gerrit.wikimedia.org/r/306342 (owner: 10Ema) [08:09:09] PROBLEM - puppet last run on mw2155 is CRITICAL: CRITICAL: Puppet has 4 failures [08:09:16] !log disabling codfw in dns [08:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:09:27] PROBLEM - puppet last run on mw2074 is CRITICAL: CRITICAL: Puppet has 1 failures [08:09:29] PROBLEM - puppet last run on mw2121 is CRITICAL: CRITICAL: Puppet has 1 failures [08:10:06] we will wait for the failover, check no more errors are shown to users, then debug the network issue [08:10:09] PROBLEM - puppet last run on db2010 is CRITICAL: CRITICAL: puppet fail [08:10:58] PROBLEM - puppet last run on elastic2023 is CRITICAL: CRITICAL: Puppet has 1 failures [08:11:29] PROBLEM - puppet last run on mw2218 is CRITICAL: CRITICAL: Puppet has 2 failures [08:12:17] PROBLEM - puppet last run on restbase-test2001 is CRITICAL: CRITICAL: Puppet has 1 failures [08:12:27] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: Puppet has 2 failures [08:12:28] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:15:38] PROBLEM - puppet last run on wtp2007 is CRITICAL: CRITICAL: puppet fail [08:16:08] RECOVERY - puppet last run on restbase2002 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [08:17:08] PROBLEM - Host mw2244 is DOWN: CRITICAL - Time to live exceeded (10.192.0.70) [08:17:08] PROBLEM - Host ms-be2003 is DOWN: CRITICAL - Time to live exceeded (10.192.0.21) [08:17:08] PROBLEM - Host mc2013 is DOWN: CRITICAL - Time to live exceeded (10.192.32.20) [08:17:08] PROBLEM - Host cp2018 is DOWN: CRITICAL - Time to live exceeded (10.192.32.117) [08:17:09] RECOVERY - puppet last run on cp2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:17:27] RECOVERY - Host mc2013 is UP: PING OK - Packet loss = 0%, RTA = 37.73 ms [08:17:27] RECOVERY - Host ms-be2003 is UP: PING OK - Packet loss = 0%, RTA = 37.72 ms [08:17:27] RECOVERY - Host cp2018 is UP: PING OK - Packet loss = 0%, RTA = 36.40 ms [08:17:29] PROBLEM - puppet last run on mw2229 is CRITICAL: CRITICAL: puppet fail [08:17:38] RECOVERY - Host mw2244 is UP: PING OK - Packet loss = 0%, RTA = 36.38 ms [08:17:48] PROBLEM - puppet last run on mw2161 is CRITICAL: CRITICAL: puppet fail [08:17:48] PROBLEM - puppet last run on cp2025 is CRITICAL: CRITICAL: Puppet has 1 failures [08:18:07] RECOVERY - puppet last run on mw2123 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [08:18:17] RECOVERY - puppet last run on mw2249 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [08:18:17] RECOVERY - puppet last run on labstore2003 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [08:18:28] PROBLEM - CirrusSearch codfw 95th percentile latency - more_like on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [2000.0] [08:18:29] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [08:18:39] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [1000.0] [08:19:17] RECOVERY - puppet last run on mw2142 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:19:18] RECOVERY - puppet last run on mw2092 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:19:19] RECOVERY - puppet last run on mw2131 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:19:37] PROBLEM - puppet last run on ganeti2004 is CRITICAL: CRITICAL: puppet fail [08:19:37] PROBLEM - puppet last run on ms-be2006 is CRITICAL: CRITICAL: Puppet has 2 failures [08:20:09] RECOVERY - puppet last run on mw2242 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [08:20:29] RECOVERY - CirrusSearch codfw 95th percentile latency - more_like on graphite1001 is OK: OK: Less than 20.00% above the threshold [1200.0] [08:21:49] PROBLEM - puppet last run on mw2113 is CRITICAL: CRITICAL: Puppet has 1 failures [08:21:49] PROBLEM - puppet last run on mw2090 is CRITICAL: CRITICAL: Puppet has 2 failures [08:23:50] PROBLEM - puppet last run on mw2084 is CRITICAL: CRITICAL: Puppet has 1 failures [08:24:08] RECOVERY - puppet last run on ms-be2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:24:18] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [08:29:07] RECOVERY - puppet last run on ms-be2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:29:13] !log disabling cr2-codfw:xe-5/0/1 (link to cr2-eqiad), flapping since 07:31 UTC [08:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:29:27] RECOVERY - puppet last run on mw2166 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [08:29:27] RECOVERY - puppet last run on elastic2024 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:29:38] RECOVERY - puppet last run on ms-be2011 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:30:47] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:32:37] RECOVERY - puppet last run on wtp2006 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [08:32:38] RECOVERY - puppet last run on restbase-test2001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [08:32:38] RECOVERY - puppet last run on ms-be2024 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [08:33:38] RECOVERY - puppet last run on mw2155 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:33:47] PROBLEM - puppet last run on ms-be2015 is CRITICAL: CRITICAL: Puppet has 2 failures [08:33:48] RECOVERY - puppet last run on mw2074 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [08:33:58] RECOVERY - puppet last run on mw2121 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [08:33:58] RECOVERY - puppet last run on mw2239 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [08:33:58] RECOVERY - puppet last run on mw2215 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:33:59] RECOVERY - puppet last run on db2053 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [08:34:18] good morning [08:35:38] RECOVERY - puppet last run on mw2204 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:36:47] RECOVERY - puppet last run on db2010 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [08:36:55] (03PS1) 10Ema: Revert "Disable codfw" [dns] - 10https://gerrit.wikimedia.org/r/306392 [08:36:58] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [08:36:59] PROBLEM - CirrusSearch codfw 95th percentile latency - more_like on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [2000.0] [08:37:28] RECOVERY - puppet last run on elastic2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:38:08] RECOVERY - puppet last run on mw2218 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:38:33] (03CR) 10Ema: [C: 032] Revert "Disable codfw" [dns] - 10https://gerrit.wikimedia.org/r/306392 (owner: 10Ema) [08:39:19] !log dns: re-enabling codfw [08:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:39:50] 06Operations, 10Beta-Cluster-Infrastructure: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#2578340 (10hashar) That is really neat @AlexMonk-WMF ! Is there anything left to do? [08:42:09] RECOVERY - puppet last run on wtp2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:43:09] RECOVERY - CirrusSearch codfw 95th percentile latency - more_like on graphite1001 is OK: OK: Less than 20.00% above the threshold [1200.0] [08:44:18] RECOVERY - puppet last run on mw2229 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [08:44:18] RECOVERY - puppet last run on ms-be2006 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [08:44:20] RECOVERY - puppet last run on mw2161 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [08:44:27] RECOVERY - puppet last run on cp2025 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:45:36] 06Operations, 10Beta-Cluster-Infrastructure: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#2578342 (10AlexMonk-WMF) Convince someone with ops rights to merge the patch [08:46:19] RECOVERY - puppet last run on ganeti2004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:46:57] RECOVERY - puppet last run on mw2067 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [08:47:21] (03PS4) 10Muehlenhoff: contint::firewall: Limit to production networks [puppet] - 10https://gerrit.wikimedia.org/r/301627 [08:48:07] RECOVERY - puppet last run on ms-fe2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:48:38] RECOVERY - puppet last run on mw2113 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:48:38] RECOVERY - puppet last run on mw2090 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:48:47] RECOVERY - puppet last run on mw2084 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:49:28] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [08:49:37] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [08:50:18] RECOVERY - puppet last run on ms-be2009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:52:37] (03PS1) 10Filippo Giunchedi: puppet_compiler: clean older output dirs [puppet] - 10https://gerrit.wikimedia.org/r/306399 (https://phabricator.wikimedia.org/T143671) [08:53:20] 06Operations, 10Continuous-Integration-Infrastructure, 10puppet-compiler, 13Patch-For-Review: OSError: [Errno 28] No space left on device on compiler02.puppet3-diffs.eqiad.wmflabs - https://phabricator.wikimedia.org/T143671#2578346 (10fgiunchedi) yup @greg ! I've proposed a patch to cleanup old compilation... [08:55:19] (03CR) 10Muehlenhoff: [C: 032] contint::firewall: Limit to production networks [puppet] - 10https://gerrit.wikimedia.org/r/301627 (owner: 10Muehlenhoff) [08:55:57] 07Puppet, 10Continuous-Integration-Infrastructure: Cant refresh Nodepool snapshot due to puppet: Could not find class passwords::puppet::database - https://phabricator.wikimedia.org/T143769#2578347 (10hashar) [08:58:38] RECOVERY - puppet last run on ms-be2015 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [08:58:39] 06Operations, 10MediaWiki-General-or-Unknown, 06Services, 10Traffic: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093#2578360 (10Jhernandez) It would be interesting to know if query parameters end up being serialized in alphabetical order, which would ma... [08:59:38] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [09:03:19] !log gallium contint::firewall: Limited to production networks https://gerrit.wikimedia.org/r/301627 . For monitoring do: grep iptables-dropped /var/log/syslog [09:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:06:09] (03PS1) 10Filippo Giunchedi: site: add codfw syslog server, wezen [puppet] - 10https://gerrit.wikimedia.org/r/306400 [09:08:59] (03CR) 10Gehel: WIP: discovery stats module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/305673 (https://phabricator.wikimedia.org/T143048) (owner: 10MaxSem) [09:10:08] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:22:08] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [09:30:54] !log starting rolling restart of elasticearch codfw for JVM and elasticsearch upgrade [09:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:32:21] (03CR) 10Filippo Giunchedi: [C: 032] site: add codfw syslog server, wezen [puppet] - 10https://gerrit.wikimedia.org/r/306400 (owner: 10Filippo Giunchedi) [09:32:26] (03PS2) 10Filippo Giunchedi: site: add codfw syslog server, wezen [puppet] - 10https://gerrit.wikimedia.org/r/306400 [09:38:58] (03PS1) 10Jcrespo: Configuration changes regarding firewall and mysql for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/306404 (https://phabricator.wikimedia.org/T126757) [09:40:11] (03CR) 10jenkins-bot: [V: 04-1] Configuration changes regarding firewall and mysql for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/306404 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [09:42:18] (03PS1) 10Filippo Giunchedi: codfw: add syslog CNAME for codfw [dns] - 10https://gerrit.wikimedia.org/r/306405 [09:43:34] (03CR) 10Filippo Giunchedi: [C: 032] codfw: add syslog CNAME for codfw [dns] - 10https://gerrit.wikimedia.org/r/306405 (owner: 10Filippo Giunchedi) [09:43:37] (03PS2) 10Filippo Giunchedi: codfw: add syslog CNAME for codfw [dns] - 10https://gerrit.wikimedia.org/r/306405 [09:45:38] PROBLEM - puppet last run on wezen is CRITICAL: CRITICAL: Puppet has 1 failures [09:46:53] (03PS2) 10Jcrespo: Configuration changes regarding firewall and mysql for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/306404 (https://phabricator.wikimedia.org/T126757) [09:49:24] (03CR) 10Muehlenhoff: [C: 032] Disable unprivileged user namespaces on trusty systems [puppet] - 10https://gerrit.wikimedia.org/r/304474 (https://phabricator.wikimedia.org/T142567) (owner: 10Muehlenhoff) [09:49:28] (03PS3) 10Muehlenhoff: Disable unprivileged user namespaces on trusty systems [puppet] - 10https://gerrit.wikimedia.org/r/304474 (https://phabricator.wikimedia.org/T142567) [09:49:43] 06Operations, 10MediaWiki-General-or-Unknown, 06Services, 10Traffic: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093#2578479 (10BBlack) My largest concern is still whether we can guarantee our application layer is insensitive to query parameter order, a... [09:50:19] (03CR) 10Muehlenhoff: [V: 032] Disable unprivileged user namespaces on trusty systems [puppet] - 10https://gerrit.wikimedia.org/r/304474 (https://phabricator.wikimedia.org/T142567) (owner: 10Muehlenhoff) [09:50:48] (03PS3) 10Gehel: WDQS caching headers [puppet] - 10https://gerrit.wikimedia.org/r/306163 (https://phabricator.wikimedia.org/T137238) [09:52:59] !log disabled creation of unprivileged user namespaces on trusty systems via sysctl kernel.unprivileged_userns_clone [09:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:53:47] RECOVERY - puppet last run on wezen is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [09:56:12] 06Operations, 10scap, 03Scap3, 15User-mobrovac: Scap::server::sources is out of sync with the repositories actually present on tin/mira - https://phabricator.wikimedia.org/T143692#2578491 (10Joe) a:03Joe [09:56:27] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet last ran 1 day ago [09:56:30] (03CR) 10Gehel: [C: 032] WDQS caching headers [puppet] - 10https://gerrit.wikimedia.org/r/306163 (https://phabricator.wikimedia.org/T137238) (owner: 10Gehel) [09:57:59] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:59:34] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [10:00:09] (03CR) 10Jcrespo: [C: 031] "Looks good to me. But it needs someone monitoring how long these queries take. Please ping me for deploy if no one is around for puppetswa" [puppet] - 10https://gerrit.wikimedia.org/r/304696 (https://phabricator.wikimedia.org/T142936) (owner: 10Nemo bis) [10:00:33] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [10:03:28] (03PS2) 10Hashar: Stop logging xff from 127.0.0.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301339 (https://phabricator.wikimedia.org/T129982) [10:04:51] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/3821/db2016.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/306404 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [10:05:14] (03PS1) 10Filippo Giunchedi: rsyslog: switch log owner based on distro [puppet] - 10https://gerrit.wikimedia.org/r/306408 [10:05:31] ^godog [10:08:01] jynus: nice, I'll take a look! [10:08:20] 06Operations, 10Wikimedia-Site-requests, 07Wikimedia-log-errors: Requests to localhost spam the 'localhost' and 'xff' log buckets - https://phabricator.wikimedia.org/T129982#2578533 (10hashar) I have rebased the patch and added a few more reviewers. [10:08:24] it may need some long explanations [10:13:38] (03CR) 10Nemo bis: "Thanks. Any deployer should be able to check the script logs and see how long the queries took, I think. As for me I'll check ganglia for " [puppet] - 10https://gerrit.wikimedia.org/r/304696 (https://phabricator.wikimedia.org/T142936) (owner: 10Nemo bis) [10:13:48] !log increase recovery throttling on elasticsearch codfw to reduce rolling restart time [10:13:49] 06Operations, 13Patch-For-Review: Disable unprivileged user namespaces on trusty kernels - https://phabricator.wikimedia.org/T142567#2578537 (10MoritzMuehlenhoff) 05Open>03Resolved This is now enabled on all trusty kernels which have backported that sysctl. [10:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:14:08] is anyone familiar with tmpreaper (an alternative to puppet tidy). Got added by Ori, and I am looking for reviewers ( patch is https://gerrit.wikimedia.org/r/#/c/300092/ ) [10:16:34] jynus: is there a db host with the settings temporarily applied already? wanted to check the metrics [10:16:47] no, but I can do that [10:17:16] basically, I have reduced the p_s to only 10-unit vectors [10:17:37] and enforced on grants nothing private can be leaked [10:21:52] I am not sure it will work with the multi-line [10:22:40] (03CR) 10Filippo Giunchedi: "LGTM overall, see comments on the firewall change though. Please split those changes in a different review, it is easier to review and rol" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/306404 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [10:25:59] jynus: heh I can't tell from the puppet compiler because /etc/default/ is marked as 'only in new' [10:26:42] so check db2034, godog [10:27:35] 06Operations, 10Continuous-Integration-Infrastructure, 07Jenkins, 13Patch-For-Review, 07Wikimedia-Incident: Jenkins files under /var/lib/jenkins/config-history/config need to be garbage collected - https://phabricator.wikimedia.org/T126552#2578554 (10hashar) p:05Normal>03High Would need someone famil... [10:27:49] egrep -v '^#' metrics.txt | wc -l -> 1087 [10:29:10] checking [10:30:18] nice, yeah that's not many more metrics [10:30:37] 06Operations, 13Patch-For-Review: Disable unprivileged user namespaces on trusty kernels - https://phabricator.wikimedia.org/T142567#2578558 (10MoritzMuehlenhoff) 05Resolved>03Open Actually, the labvirt* trusty systems running the HWE kernel also need to be covered, reopening. [10:30:44] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Tune WDQS caching headers - https://phabricator.wikimedia.org/T137238#2578560 (10Gehel) 05Open>03Resolved a:03Gehel New caching headers deployed. Checked with chrome, cache-control headers are showing up. [10:31:34] PROBLEM - MD RAID on ms-be1027 is CRITICAL: CRITICAL: Active: 5, Working: 5, Failed: 1, Spare: 0 [10:32:53] PROBLEM - HP RAID on ms-be1027 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor - Failed: 2I:4:1, 2I:4:2, 1I:1:3, 1I:1:4 [10:37:47] 06Operations, 10Wikimedia-Mailing-lists: deactivate maint-announce - https://phabricator.wikimedia.org/T143760#2578571 (10Aklapper) History: T118176, T132968. [10:38:42] (03PS1) 10Urbanecm: TEST COMMIT - WILL BE ABANDONED IN A MOMENT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306412 [10:39:14] (03CR) 10jenkins-bot: [V: 04-1] TEST COMMIT - WILL BE ABANDONED IN A MOMENT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306412 (owner: 10Urbanecm) [10:40:02] (03Abandoned) 10Urbanecm: TEST COMMIT - WILL BE ABANDONED IN A MOMENT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306412 (owner: 10Urbanecm) [10:44:03] (03CR) 10Filippo Giunchedi: [C: 032] rsyslog: switch log owner based on distro [puppet] - 10https://gerrit.wikimedia.org/r/306408 (owner: 10Filippo Giunchedi) [10:49:41] !log start installing hhvm updates in eqiad [10:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:57:40] (03Draft2) 10Paladox: Fix phabricator expanding links [puppet] - 10https://gerrit.wikimedia.org/r/306413 (https://phabricator.wikimedia.org/T75997) [10:57:46] (03Draft1) 10Paladox: Fix phabricator expanding links [puppet] - 10https://gerrit.wikimedia.org/r/306413 (https://phabricator.wikimedia.org/T75997) [10:59:05] (03CR) 10Paladox: "I did testing here http://gerrit-test.wmflabs.org/gerrit/#/c/16/ (Please look at the phabricator links and plain T1 links for example plea" [puppet] - 10https://gerrit.wikimedia.org/r/306413 (https://phabricator.wikimedia.org/T75997) (owner: 10Paladox) [11:00:08] (03PS3) 10Paladox: Fix phabricator expanding links [puppet] - 10https://gerrit.wikimedia.org/r/306413 (https://phabricator.wikimedia.org/T75997) [11:00:25] (03CR) 10Hashar: "recheck" [debs/contenttranslation/giella-sme] - 10https://gerrit.wikimedia.org/r/294430 (https://phabricator.wikimedia.org/T120087) (owner: 10KartikMistry) [11:01:43] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [11:07:54] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:09:58] (03CR) 10Thiemo Mättig (WMDE): [C: 031] "The character class should not be [\"#] but [\"#<] or it will still mess up pasted URLs that are already converted to HTML." [puppet] - 10https://gerrit.wikimedia.org/r/306413 (https://phabricator.wikimedia.org/T75997) (owner: 10Paladox) [11:10:34] !log enable collection of mysqld metrics from prometheus2002 too [11:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:10:44] jynus: ^ collecting from both machines now [11:11:50] how does it work, do they query twice? [11:11:59] or is there some kind of proxy? [11:12:14] also, is the format easy to fill-in? [11:12:24] sorry, too many question :-) [11:12:28] s [11:13:25] 06Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Upgrade jenkins-debian-glue on Jessie slaves from 0.13.0 to latest (0.17.0) - https://phabricator.wikimedia.org/T141114#2578655 (10hashar) Need #operations to publish jenkins-debian-glue packages to apt.wikimedia.org T141114#2488638 [11:13:38] 06Operations, 10Continuous-Integration-Infrastructure: Upgrade jenkins-debian-glue on Jessie slaves from 0.13.0 to latest (0.17.0) - https://phabricator.wikimedia.org/T141114#2578657 (10hashar) p:05Triage>03Normal [11:15:09] (03PS3) 10Jcrespo: Configuration changes regarding firewall and mysql for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/306404 (https://phabricator.wikimedia.org/T126757) [11:18:44] (03PS4) 10Jcrespo: Configuration changes regarding mysql exporter for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/306404 (https://phabricator.wikimedia.org/T126757) [11:18:46] jynus: yeah each polls independently [11:19:19] good for HA, double the queries generated :-/ [11:20:07] (03PS5) 10Jcrespo: Configuration changes regarding mysql exporter for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/306404 (https://phabricator.wikimedia.org/T126757) [11:21:12] (03PS1) 10Ladsgroup: Increase ORES threshold for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306419 (https://phabricator.wikimedia.org/T143738) [11:21:58] jynus: indeed [11:22:03] going to lunch, brb [11:23:31] (03CR) 10Jcrespo: "The putting again the default modules is on purpose, I want to know at a glance which modules are enabled. If I could, I would reset all t" [puppet] - 10https://gerrit.wikimedia.org/r/306404 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [11:39:23] 06Operations, 10Beta-Cluster-Infrastructure: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#2578713 (10hashar) So that is now pending review / merge of https://gerrit.wikimedia.org/r/247587 //beta: Use Let's Encrypt cert// which is already on beta cluster. [11:43:14] (03CR) 10Paladox: "@Thiemo Mättig (WMDE)oh, that's probably why it didn't work properly with the < added in there seems to break things." [puppet] - 10https://gerrit.wikimedia.org/r/306413 (https://phabricator.wikimedia.org/T75997) (owner: 10Paladox) [11:48:29] (03CR) 10Paladox: "It seems the html is ok doing it this way. adding < breaks the T1 plain comment, ie it breaks the link." [puppet] - 10https://gerrit.wikimedia.org/r/306413 (https://phabricator.wikimedia.org/T75997) (owner: 10Paladox) [11:48:42] (03PS2) 10Ladsgroup: Make default ORES threshold soft (higher threshold) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306419 (https://phabricator.wikimedia.org/T143738) [12:01:43] (03CR) 10Mobrovac: [C: 031] Change-Prop: Enable file transclusion updates [puppet] - 10https://gerrit.wikimedia.org/r/306308 (owner: 10Ppchelko) [12:05:20] hi is there a problem with the wikimedia servers? [12:05:22] Our servers are currently under maintenance or experiencing a technical problem. This is probably temporary and should be fixed soon. [12:05:50] it happens very often last half hour [12:10:44] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:11:51] doctaxon: Does the error page have any details about the error? [12:12:42] Request from 10.68.23.58 via cp1055 cp1055, Varnish XID 2735770911
Error: 503, Service Unavailable at Wed, 24 Aug 2016 12:02:37 GMT

[12:12:52] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [12:12:54] RECOVERY - MegaRAID on es2004 is OK: OK: optimal, 1 logical, 2 physical [12:13:40] Glaisher: ^ if this is more detail, I don't know [12:14:12] It's not very helpful for me. Maybe someone with access to logs might be able to help. [12:14:39] doctaxon: Does it happen randomly or while trying to do a certain action? [12:14:48] randomly [12:15:25] On which wiki? [12:15:47] dewiki [12:16:08] maybe other too, but I don't know [12:17:27] 06Operations, 10ops-codfw, 10DBA: es2004 has a dead disk, but it is not under warranty - https://phabricator.wikimedia.org/T143220#2578791 (10jcrespo) ``` RECOVERY - MegaRAID on es2004 is OK: OK: optimal, 1 logical, 2 physical ``` [12:17:35] 06Operations, 10ops-codfw, 10DBA: es2004 has a dead disk, but it is not under warranty - https://phabricator.wikimedia.org/T143220#2578792 (10jcrespo) 05Open>03Resolved [12:18:04] jynus: if your still around^ [12:19:12] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:22:06] <_joe_> doctaxon: looking into it [12:23:22] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [12:23:50] let me know about, _joe_ [12:24:08] I see 503 on phabricator, testwikis [12:24:15] <_joe_> doctaxon: uhm this is actually strange, most of the errors I see are due to the pageviews apis [12:24:21] aside from that only 500 on pageviews [12:24:34] and the fake errors on uploads [12:25:53] doctaxon, what did you request? [12:26:02] what url? [12:26:05] enwiki? [12:27:03] (03CR) 10Thiemo Mättig (WMDE): "Adding the # is definitely correct, so +2 for this patch here." [puppet] - 10https://gerrit.wikimedia.org/r/306413 (https://phabricator.wikimedia.org/T75997) (owner: 10Paladox) [12:27:04] dewiki [12:27:33] can you share the page/url that gave you the error? [12:29:07] !log remove old logstash indices that were not deleted after 31 days [12:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:29:12] url: https://de.wikipedia.org/wiki/Spezial:ApiSandbox#action=query&format=json&prop=info&titles=Mys%C5%82owice+(S%C5%82awoborze)&inprop= [12:29:26] doctaxon, thank you [12:34:11] (03CR) 10Hashar: [C: 031] "Looked at it again. The Jessie slaves have both jdk 7 and 8, as we transition java based jobs from Trusty to Jessie we would need to keep " [puppet] - 10https://gerrit.wikimedia.org/r/295880 (https://phabricator.wikimedia.org/T138506) (owner: 10Hashar) [12:34:40] but it was different URLs [12:35:34] doctaxon, do you mean the url "Spezial:ApiSandbox" really, or the api call that would do? [12:36:05] becase we cannot see any error with Spezial on the url in the last hour [12:36:06] moritzm: I have looked again at the patch to get java 8 on Jessie jenkins slaves. We would need both 7 and 8 in parallel https://gerrit.wikimedia.org/r/#/c/295880/ [12:36:17] moritzm: I am not sure though how the alternative is set, seems to default to 7 [12:36:41] the api call [12:38:05] jynus: I'm running a toollabs script, it's an API Info call [12:39:05] hashar: the java versions are managed via https://wiki.debian.org/DebianAlternatives, I'm not sure if there's a puppet module to manage the preferred alternative [12:39:34] we have [12:39:42] alternatives::select or something like that in puppet [12:39:52] doctaxon, I can barely see errors on that, the only one is already tracked at: [12:39:52] otherwise I think they depend on installation order, so whatever gets installed last assumes the alternative, not sure, would need to test [12:40:03] I have checked /var/lib/dpkg/alternatives/java and on Jessie java7 is preferred over java 8 [12:40:13] so we get java 7 consistently across the fleet of CI slaves distro [12:40:30] and with that patch the Jessie slaves additionally have java 8 for those builds that requires it [12:40:55] https://phabricator.wikimedia.org/T141765 [12:44:16] hashar: might be ok, maybe doublecheck in labs: create a jessie instance, install openjdk-8 and check the result of "java -version" [12:44:46] it is already deployed on the ci fleet :D [12:45:04] I have checked a jessie slave that got created a couple weeks ago and it has the proper alternative set ( java 7 ) [12:45:11] jynus: sorry, I cannot help more [12:45:17] java being set to auto, alternatives takes java 7 [12:46:23] doctaxon, what I mean is there is no ongoing issue, and not reproducible one; 1 out of 1 million times it is possible to get one of those messages, you can safely ignore it and retry unless it happens all the time [12:47:03] yes I know [12:47:31] but the 1 out 1 mill. ? It is not possible to reduce it? [12:47:38] there were some small network issues earlier in the morning, though [12:47:45] but no longer ongoing [12:48:15] so if you had that more frequently before, it was that [12:48:35] it's running better now [12:48:59] doctaxon, we are working all the time to make that as close to 0 as possible :-) , the ticket I mentioned before will solve another extra percentage [12:49:16] super [12:49:29] and I know that you are doing so much [12:49:35] you're great [12:50:16] jynus: but there's nothing happening with open logging issues [12:50:41] open logging issues, what do you mean? [12:50:53] just like T142923 [12:50:54] T142923: Missing move log and merge log of the target page in dewiki - https://phabricator.wikimedia.org/T142923 [12:52:17] doctaxon, sorry, I cannot help with that, that is outside of my technical expertise, probably -tech is the right place? [12:52:22] the MediaWiki-Special-pages workboard is so very full [12:52:44] and all of that is open [12:52:46] hashar: sounds good to merge, then :-) [12:53:21] and that to running mediawiki software [12:54:09] what I mean is that for reporting server/network errors this is the right place [12:54:34] for software (mediawiki errors) the right people may not be here (I am not sure) [12:55:14] okay [12:55:28] See the edit on the ticket: "MediaWiki-Special-pages; removed Operations." [12:56:25] !log changeprop deploying c5fd932 [12:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:57:20] moritzm: yeah :) [12:57:32] moritzm: and later on migrate to java 8 :} [13:00:04] hashar, Dereckson, addshore, and aude: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160824T1300). [13:00:04] Amir1: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:12] Hey [13:00:19] :) [13:00:30] This one is easy :D [13:00:40] are you handling it or can we get zeljkof to do it ? [13:02:34] hashar: IDK. I'm not a deployer (I don't have gerrit rights though) [13:02:46] going to pair that one with zeljkof [13:02:49] is coming in a few [13:02:54] he is coming in a few [13:02:57] okay thanks [13:03:59] hashar: Amir1 o/ [13:04:08] Hey [13:04:17] :) [13:04:24] sorry, on a machine that I did not use a few days, everything wants to update :| [13:04:31] oh man ! :D [13:04:38] skip update and ssh to tin :} [13:04:42] A) CR+2 https://gerrit.wikimedia.org/r/#/c/306419/ [13:05:42] hashar: yeah, cancelling everything [13:05:47] then on tin.eqiad.wmnet in /srv/mediawiki-staging follow https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment#In_your_own_repo_via_gerrit [13:05:54] PROBLEM - Disk space on labnet1002 is CRITICAL: DISK CRITICAL - free space: /boot 10 MB (3% inode=99%) [13:05:56] are you joining the hangout, or are we doing it here? [13:06:13] eg fetch from remote compare head with upstream: git diff HEAD..HEAD@{u} or git diff HEAD..origin/master [13:06:17] if happy: git rebase [13:06:40] the alert for labnet1002 is due to having installed a new kernel, I'll drop some of the old, unused ones there [13:07:17] then: scap sync-file wmf-config/path/to/file 'some message and Txxxx' [13:07:22] ok, I'll do the SWAT today [13:07:29] looking at https://gerrit.wikimedia.org/r/#/c/306419/ [13:07:33] lets do it here [13:07:38] Amir1 can't join hangout [13:07:50] I can SWAT today! [13:07:51] (03CR) 10Hashar: [C: 031] Make default ORES threshold soft (higher threshold) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306419 (https://phabricator.wikimedia.org/T143738) (owner: 10Ladsgroup) [13:07:53] RECOVERY - Disk space on labnet1002 is OK: DISK OK [13:07:56] (reading the instructions) [13:07:58] hangout? [13:08:04] I can do it if you need to [13:08:20] if I knew it is irc only, I could have stayed at my usual machine :) [13:08:24] nevermind [13:08:36] I dont mind joining a hangout [13:09:29] hashar, Amir1: in that case, see you in https://hangouts.google.com/hangouts/_/wikimedia.org/euswat [13:09:33] I will invite Amir1 [13:09:35] ah [13:09:39] joining that one [13:10:02] requesting to join the call [13:11:33] Amir1: can you request again on https://hangouts.google.com/hangouts/_/wikimedia.org/euswat ? [13:12:43] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [13:13:46] (03CR) 10Zfilipin: [C: 032] Make default ORES threshold soft (higher threshold) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306419 (https://phabricator.wikimedia.org/T143738) (owner: 10Ladsgroup) [13:13:52] (03CR) 10Zfilipin: "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306419 (https://phabricator.wikimedia.org/T143738) (owner: 10Ladsgroup) [13:14:14] (03Merged) 10jenkins-bot: Make default ORES threshold soft (higher threshold) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306419 (https://phabricator.wikimedia.org/T143738) (owner: 10Ladsgroup) [13:16:02] jouncebot: next [13:16:02] In 4 hour(s) and 43 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160824T1800) [13:16:20] 4 hours? hmm, ok [13:17:03] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [13:17:11] (03PS1) 10Ema: Upgrade cp1048 (cache_upload) to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/306427 (https://phabricator.wikimedia.org/T131502) [13:17:32] who's swatting now? [13:17:37] mafk: hi, if you've something to deploy now, we've still fourty five minutes available [13:17:44] zeljkof seems tob e [13:17:55] Dereckson: yep, the mswiki patch [13:18:00] I finally got time to do this [13:18:02] we are [13:18:06] will put on wikitech:D [13:18:15] Amir1 zeljkof and I are in hangouts https://hangouts.google.com/hangouts/_/wikimedia.org/euswat [13:18:16] 1 min please [13:18:17] if you wanna join [13:18:32] jouncebot: refresh [13:18:35] I refreshed my knowledge about deployments. [13:18:36] jouncebot: current [13:18:50] hangout for a swat? [13:19:24] (03PS1) 10Giuseppe Lavagetto: scap::source: use puppet to manage directory creation [puppet] - 10https://gerrit.wikimedia.org/r/306429 [13:19:26] (03PS1) 10Giuseppe Lavagetto: scap::source: allow picking phabricator as a source. [puppet] - 10https://gerrit.wikimedia.org/r/306430 [13:19:28] (03PS1) 10Giuseppe Lavagetto: scap::source: also define the corresponding dsh group [puppet] - 10https://gerrit.wikimedia.org/r/306431 [13:19:32] hashar: problem is that it ask me for micro and camera access, and I don't want that ;) [13:19:33] yeah for training purposes / screen sharing [13:19:43] mafk: you can just lurk :} [13:20:02] paravoid: intent is to get more swat deployers [13:21:32] (03PS7) 10MarcoAurelio: Fully restrict uploads on ms.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305436 (https://phabricator.wikimedia.org/T141227) [13:22:21] (03PS2) 10Ema: Upgrade cp4005 (ulsfo cache_upload) to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/306427 (https://phabricator.wikimedia.org/T131502) [13:22:34] hashar: on wikitech:Deployements now [13:22:42] waiting for rebase in progress [13:23:01] (done) [13:25:01] zeljkof: can you do a SWAT for a patch of mine? [13:27:20] mafk: sure [13:27:38] (03CR) 10BBlack: [C: 031] Upgrade cp4005 (ulsfo cache_upload) to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/306427 (https://phabricator.wikimedia.org/T131502) (owner: 10Ema) [13:27:42] I listed it in Wikitech:Deployements [13:27:46] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: Make default ORES threshold soft (higher threshold) T143738 (duration: 00m 59s) [13:27:47] T143738: Edits being flagged by review tool on enwiki aren't likely to be damaging - https://phabricator.wikimedia.org/T143738 [13:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:27:53] sorry for being late [13:29:17] (03CR) 10Filippo Giunchedi: [C: 031] Configuration changes regarding mysql exporter for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/306404 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [13:29:37] mafk: as soon as we finish checking the current deployment I will ping you, please wait, your patch is important to us :) [13:29:56] zeljkof: no problems [13:30:02] I'll be around [13:30:17] * mafk pings Glaisher [13:32:12] Thanks zeljkof and hashar [13:32:31] Amir1: if you need a patch for ORES we can do that before or after the train this evening [13:32:36] or use SWAT later today or tomorrow [13:33:06] [config] 305436 Fully disable uploads on mswiki (task T141227) ? [13:33:07] T141227: Disable local uploads on ms.wikipedia.org (Malay Wikipedia) - https://phabricator.wikimedia.org/T141227 [13:33:14] mafk: looking at your patch [13:33:19] Okay. [13:34:01] hashar: that's my patch [13:34:07] (03CR) 10Hashar: [C: 032] Fully restrict uploads on ms.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305436 (https://phabricator.wikimedia.org/T141227) (owner: 10MarcoAurelio) [13:34:15] mafk: processing with your patch [13:34:27] (03CR) 10Zfilipin: "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305436 (https://phabricator.wikimedia.org/T141227) (owner: 10MarcoAurelio) [13:34:42] (03Merged) 10jenkins-bot: Fully restrict uploads on ms.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305436 (https://phabricator.wikimedia.org/T141227) (owner: 10MarcoAurelio) [13:34:46] it seems it's getting quite some attention heh [13:36:08] mafk: https://ms.wikipedia.org/ has Upload links already pointing to uploadwizard [13:36:42] hashar: yep, because the wiki is on commonsuploads.dblist iirc [13:36:48] k [13:36:51] I removed the wiki from there [13:36:57] and Special:Upload says perm error [13:36:59] and disabled local uploads [13:37:05] zeljkof is going to push it to mw1099 [13:37:06] even from sysops [13:37:17] yep, only sysops were suposed to be able to local upload [13:37:23] now we remove this entirely [13:37:40] x-wikimedia-debug activated, waiting for your signal [13:37:50] Tch, we give the sysop flag to just about anyone who asks anyway ;) [13:38:03] (yay more UW users!) [13:38:20] are canary servers configured with more logging? [13:38:36] mafk: it is on mw1099 now :} [13:38:38] * mafk would find useful sysop on wikitech marktraceur, just saying 9_9 [13:38:50] Would that I could, mafk [13:39:02] Special:Upload now yields "File uploads are disabled." [13:39:03] looks fine to me ? [13:39:15] I hardly ever log in to wikitech, fat lot of good it would do me to have sysop (much less 'crat) there [13:39:23] !log upgrading cp4005 to Varnish 4 (T131502) [13:39:24] T131502: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502 [13:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:39:50] (03CR) 10Nemo bis: "Sorry jcrespo, didn't see the second line. I think the most reliable way to see what queries are run is to dig logs for queries (on other " [puppet] - 10https://gerrit.wikimedia.org/r/304696 (https://phabricator.wikimedia.org/T142936) (owner: 10Nemo bis) [13:39:53] hashar: special:listgrouprights@mswiki on mw1099 still has upload permissions to sysops [13:39:53] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [13:40:00] is it correctly on mw1099? [13:40:07] yup [13:40:15] maybe the patch need some more tweaks? [13:40:50] Doesn't look good to me. [13:40:53] mafk: the permission would be to be dropped from sysops [13:41:04] then if upload is disabled, I am not sure whether sysops will be able to upload [13:41:15] will want to look at the user rights in our settings.php file [13:41:39] (03CR) 10Ema: [C: 032] Upgrade cp4005 (ulsfo cache_upload) to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/306427 (https://phabricator.wikimedia.org/T131502) (owner: 10Ema) [13:41:46] I have to look at other wikis that have uploads disabled [13:41:50] ie eswiki [13:41:55] not on commonsuploads.dblist [13:42:13] and with $wgEnableUploads = [ 'false' ] [13:42:21] well with wgEnableUploads = false [13:42:35] I would say that user having 'upload' permissions would not be able to upload anyway [13:42:36] q [13:42:42] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/3: down - Core: cr2-codfw:xe-5/0/1 (Zayo, OGYX/120003//ZYO) 36ms {#2909} [10Gbps wave]BR [13:42:48] let me check [13:43:15] eswiki is in that case https://es.wikipedia.org/wiki/Especial:ListaDerechosGrupos [13:43:19] that list 'upload' [13:43:28] yep [13:43:32] I've just checked [13:43:56] I'd prefer however not to merge at this stage and would like to consult [13:44:11] it is all fine [13:44:11] it looks weird that those upload rights ain't removed [13:44:12] pushing that [13:44:21] we can look at polishing up the permissions later on [13:44:24] okay then [13:45:00] or create an uploadsdisabled.dblist with */user/autoconfirmed/confirmed/sysop set to uploads = false [13:45:11] hashar: I found the issue, I'm making the patch. Can we push it at this window? [13:45:16] yeah [13:45:24] if it tested :-D [13:45:34] probably want to exercise it on beta cluster first [13:45:46] then we can sneak deploy it later on [13:46:02] Amir1: I am in a meeting in 15 minutes so can t really baby sit it [13:46:52] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [13:47:05] I can test after merging [13:47:31] rights in my globalgroup heh [13:47:52] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:48:03] !log zfilipin@tin Synchronized dblists/commonsuploads.dblist: Fully restrict uploads on ms.wikipedia T141227 (duration: 00m 46s) [13:48:05] T141227: Disable local uploads on ms.wikipedia.org (Malay Wikipedia) - https://phabricator.wikimedia.org/T141227 [13:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:49:03] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: Fully restrict uploads on ms.wikipedia T141227 (duration: 00m 46s) [13:49:05] T141227: Disable local uploads on ms.wikipedia.org (Malay Wikipedia) - https://phabricator.wikimedia.org/T141227 [13:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:49:18] mafk: deployed! [13:49:22] please check [13:49:37] testing [13:49:54] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [13:50:27] "File uploads are disabled." @ Special:Upload at ms.wikipedia [13:50:29] LGTM [13:50:34] great :} [13:50:43] !log European SWAT done [13:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:50:51] jouncebot: next [13:50:51] In 4 hour(s) and 9 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160824T1800) [13:50:57] we are sticking around :) [13:51:06] Amir1: we can get your ORES patch four hours from now [13:51:13] maybe afterlunch swat :P [13:51:14] can get it tested on beta cluster [13:51:23] :D [13:51:24] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: Puppet has 2 failures [13:51:29] Awesome [13:51:38] feel free to add it to the other swat window, and poke someone to baby sit then if you can't attend [13:51:41] Lunch for whom ;) [13:52:02] Amir1: I am doing the mw train deployment just after that swat window [13:52:50] okay [13:53:17] !log rolling restart of logstash nodes to validate fix to T142357 [13:53:18] T142357: Elasticsearch restarts are failing in the logstash cluster - https://phabricator.wikimedia.org/T142357 [13:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:53:23] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [13:54:24] zeljkof: and I commented on Amir1 / ORES task https://phabricator.wikimedia.org/T143738#2578985 about the SWAT outcome [13:54:34] as to keep an audit trail for the next patch [13:54:50] so if something is done in the next swat window and we are not around, the deployer has the context / history [13:54:56] okay [13:55:02] Awesome [13:55:02] without having to dig in the irc log / !log whatever [13:55:03] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/3: down - Core: cr2-codfw:xe-5/0/1 (Zayo, OGYX/120003//ZYO) 36ms {#2909} [10Gbps wave]BR [13:55:23] The thing is that I fixed the wrong the global variable :D [13:55:36] that global variable is obsolete [13:55:39] ;-} [13:56:06] it should be $wgDefaultUserOptions["oresDamagingPref"] [13:56:50] and I deprecated that variable, I have memory of a gold fish [13:56:56] (03CR) 10Hashar: "Time is CEST:" [puppet] - 10https://gerrit.wikimedia.org/r/295880 (https://phabricator.wikimedia.org/T138506) (owner: 10Hashar) [13:57:26] no worries [13:58:33] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:59:12] PROBLEM - ElasticSearch health check for shards on logstash1006 is CRITICAL: CRITICAL - elasticsearch inactive shards 34 threshold =0.1% breach: status: yellow, number_of_nodes: 5, unassigned_shards: 34, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 34, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number [13:59:13] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 34 threshold =0.1% breach: status: yellow, number_of_nodes: 5, unassigned_shards: 34, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 34, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number [13:59:24] PROBLEM - ElasticSearch health check for shards on logstash1005 is CRITICAL: CRITICAL - elasticsearch inactive shards 34 threshold =0.1% breach: status: yellow, number_of_nodes: 5, unassigned_shards: 34, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 34, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number [13:59:24] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 34 threshold =0.1% breach: status: yellow, number_of_nodes: 5, unassigned_shards: 34, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 34, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number [13:59:52] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 34 threshold =0.1% breach: status: yellow, number_of_nodes: 5, unassigned_shards: 34, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 34, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number [14:00:14] ^ apologies for the spam, logstash restart not going as well as planned [14:00:53] PROBLEM - ElasticSearch health check for shards on logstash1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 14 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 10, number_of_pending_tasks: 5, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 34, task_max_waiting_in_queue_millis: 2382, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_num [14:01:32] RECOVERY - ElasticSearch health check for shards on logstash1006 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 6, number_of_pending_tasks: 11, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 34, task_max_waiting_in_queue_millis: 34741, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1960784314, [14:01:33] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 6, number_of_pending_tasks: 11, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 34, task_max_waiting_in_queue_millis: 39802, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1960784314, [14:01:42] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 5, number_of_pending_tasks: 10, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 34, task_max_waiting_in_queue_millis: 44036, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 91.1764705882, [14:01:42] RECOVERY - ElasticSearch health check for shards on logstash1005 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 5, number_of_pending_tasks: 10, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 34, task_max_waiting_in_queue_millis: 44098, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 91.1764705882, [14:02:03] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 34, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 98.0392156863, acti [14:02:03] ACKNOWLEDGEMENT - ElasticSearch health check for shards on logstash1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 14 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 10, number_of_pending_tasks: 5, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 34, task_max_waiting_in_queue_millis: 2382, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percen [14:03:02] RECOVERY - ElasticSearch health check for shards on logstash1004 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 34, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 99.0196078431, acti [14:03:21] Platonides: ? [14:03:54] Platonides: mind explaining the kickban? [14:04:28] maybe an over-aggressive automated script in his client? [14:04:32] (03PS1) 10Ladsgroup: Change ORES threshold to soft as default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306435 (https://phabricator.wikimedia.org/T143738) [14:04:40] probably [14:05:20] I mentioned issues with open proxies, maybe he automated that [14:05:31] hashar: https://gerrit.wikimedia.org/r/#/c/306435/1 [14:05:49] aye that's possible indeed [14:05:57] I can make a patch for beta [14:06:07] godog, I think we cannot do a proper dasboard on grafana [14:06:08] whatever it is it isn't working very well I think though [14:06:38] jynus: ah, what's missing? [14:06:43] I can do tables by time [14:06:47] (03PS1) 10ArielGlenn: handle prereq properly when a previous dump task has failed rather than not run [dumps] - 10https://gerrit.wikimedia.org/r/306436 [14:06:50] by metric [14:06:54] etc. [14:07:15] but it is not possible to have several "latest metrics" for several servers [14:07:20] not in the same panel [14:08:01] Bassically I want something like "Host IPv4 Release RAM Up Act. QPS Rep Lag Tree" [14:08:21] a quick look summary, but without aggregation [14:08:55] ah, yeah I haven't tried that yet, trying now [14:09:01] I checked http://play.grafana.org/dashboard/db/table-panel-showcase [14:09:05] (03CR) 10ArielGlenn: [C: 032] handle prereq properly when a previous dump task has failed rather than not run [dumps] - 10https://gerrit.wikimedia.org/r/306436 (owner: 10ArielGlenn) [14:09:17] you can have different aggregations, but always from the same metric [14:10:09] this is different from the use case of "aggregating all nodes" [14:10:15] I can do that [14:11:17] maybe something with repeating template members? [14:17:17] mhh repeating on what variable? [14:17:58] (03PS2) 10Giuseppe Lavagetto: scap::source: use puppet to manage directory creation [puppet] - 10https://gerrit.wikimedia.org/r/306429 [14:18:00] (03PS2) 10Giuseppe Lavagetto: scap::source: allow picking phabricator as a source. [puppet] - 10https://gerrit.wikimedia.org/r/306430 [14:18:03] (03PS2) 10Giuseppe Lavagetto: scap::source: also define the corresponding dsh group [puppet] - 10https://gerrit.wikimedia.org/r/306431 [14:18:04] (03PS1) 10Giuseppe Lavagetto: role::deployment::server: fix scap3/trebuchet declarations [puppet] - 10https://gerrit.wikimedia.org/r/306440 (https://phabricator.wikimedia.org/T143692) [14:18:13] repeating the panel, but with a differnt instance (a selector with all selected) [14:18:15] <_joe_> thcipriani|afk: whenever you're around, these are up for you ^^ [14:18:40] I do not know, just guessing [14:18:49] _joe_: thank you! Will review [14:20:38] jynus: could be! let me know if you discover how to do it, we could also ask the grafana/prometheus people [14:20:57] you understood well what I mean, right? [14:21:34] yeah you basically want a table with an "audit" of all servers matching some criteria [14:21:51] yes, with different metrics on columns [14:21:51] but the values come from different metrics and they need to be collated into the table [14:28:10] (03PS1) 10Filippo Giunchedi: [WIP] base: support for multiple syslog hosts [puppet] - 10https://gerrit.wikimedia.org/r/306442 [14:31:32] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [14:31:50] (03PS2) 10Filippo Giunchedi: [WIP] base: support for multiple syslog hosts [puppet] - 10https://gerrit.wikimedia.org/r/306442 [14:32:00] (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/giella-sme] - 10https://gerrit.wikimedia.org/r/294430 (https://phabricator.wikimedia.org/T120087) (owner: 10KartikMistry) [14:33:29] (03PS1) 10MarcoAurelio: Remove upload rights on wikis where local uploads are disabled. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306443 (https://phabricator.wikimedia.org/T143789) [14:33:35] (03PS2) 1020after4: WIP: Scap swat command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) [14:34:00] (03CR) 10jenkins-bot: [V: 04-1] Remove upload rights on wikis where local uploads are disabled. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306443 (https://phabricator.wikimedia.org/T143789) (owner: 10MarcoAurelio) [14:37:07] (03PS1) 10Aaron Schulz: Fix getWikiVersion() error message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306444 [14:37:24] (03PS3) 10Filippo Giunchedi: [WIP] base: support for multiple syslog hosts [puppet] - 10https://gerrit.wikimedia.org/r/306442 [14:39:23] (03CR) 10Chad: [C: 032] Fix getWikiVersion() error message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306444 (owner: 10Aaron Schulz) [14:39:51] (03Merged) 10jenkins-bot: Fix getWikiVersion() error message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306444 (owner: 10Aaron Schulz) [14:41:34] !log demon@tin Synchronized multiversion/getMWVersion.php: fixup error message (duration: 00m 47s) [14:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:43:54] 06Operations, 10ops-codfw, 06Discovery: codfw: rack/setup/deploy wdqs200[12]switch configuration - https://phabricator.wikimedia.org/T143613#2579089 (10Papaul) @akosiaris can you please check the switch port again for me please. I tried to install both servers yesterday i couldn't get to the DHCP server. T... [14:43:57] (03CR) 10BryanDavis: [C: 031] "LGTM. If someone suddenly notices that they need this it will be easy enough to re-enable." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301339 (https://phabricator.wikimedia.org/T129982) (owner: 10Hashar) [14:45:37] (03PS4) 10Filippo Giunchedi: [WIP] base: support for multiple syslog hosts [puppet] - 10https://gerrit.wikimedia.org/r/306442 [14:45:43] bd808: thx :) [14:47:11] hashar: JFDI ;) [14:47:36] I guess :} [14:49:10] I am xiaocao2015 that talked with Shanmugamp7 on #wikimedia-operations [14:49:18] Anyone here????? [14:49:42] My account [[User:平天下的小曹2015]] was be hacked iin 2016.8.20. My all accounts in every websites was be hacked. I changed all password at first time. But it's too late.When I changing my password of wikipedia,I know my email address also modified,too. [14:49:51] My account [[User:平天下的小曹2015]] was be hacked in 2016.8.20. My all accounts in every websites was be hacked. I changed all password at first time. But it's too late.When I changing my password of wikipedia,I know my email address also modified,too. [14:50:05] Anyone here????? [14:50:43] Anyone here????? [14:50:54] Can you see me? [14:51:52] <_joe_> xiaocao2015_: yes, this is an async medium; multiple people probably ready you already. [14:52:26] ok [14:52:33] How to recover my accont? [14:52:56] (03PS2) 10MarcoAurelio: Remove upload rights on wikis where local uploads are disabled. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306443 (https://phabricator.wikimedia.org/T143789) [14:53:17] What??? [14:53:52] (03PS3) 10MarcoAurelio: Remove upload rights on wikis where local uploads are disabled. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306443 (https://phabricator.wikimedia.org/T143789) [14:54:18] I can't understand. [14:54:33] I am not good at Engilsh. I am Chinese. [14:54:46] It is a question of my account? [14:55:34] <_joe_> xiaocao2015_: no, don't worry [14:55:57] <_joe_> xiaocao2015_: I have no idea what you should do now, but I am asking [14:56:22] Help me to recover that please. [14:56:30] Tne f**king hacker. [14:56:30] (03PS5) 10Filippo Giunchedi: [WIP] base: support for multiple syslog hosts [puppet] - 10https://gerrit.wikimedia.org/r/306442 [14:56:45] My bilibili account has be hacked yesterday,too. [14:57:23] xiaocao2015_: we are looking into it, please allow for a few minutes of delay... [14:59:22] OK. [14:59:27] Thank you very much. [15:00:11] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup/deploy wezen (codfw syslog) - https://phabricator.wikimedia.org/T143146#2579100 (10fgiunchedi) 05Open>03Resolved host in puppet, resolving [15:01:04] (03PS6) 10Filippo Giunchedi: base: support for multiple syslog hosts [puppet] - 10https://gerrit.wikimedia.org/r/306442 (https://phabricator.wikimedia.org/T138073) [15:03:02] (03CR) 10Filippo Giunchedi: "puppet compiler run: https://puppet-compiler.wmflabs.org/3828/" [puppet] - 10https://gerrit.wikimedia.org/r/306442 (https://phabricator.wikimedia.org/T138073) (owner: 10Filippo Giunchedi) [15:03:18] 06Operations, 10ops-codfw, 06Discovery: rack/setup/deploy wdqs200[12] - https://phabricator.wikimedia.org/T142864#2579115 (10Papaul) [15:03:49] 06Operations, 13Patch-For-Review: setup syslog server in codfw - https://phabricator.wikimedia.org/T138073#2579117 (10fgiunchedi) a:03fgiunchedi [15:07:31] <_joe_> BanBot? [15:07:35] <_joe_> seriously? [15:09:25] _joe_: seems so, but it's useful [15:09:58] <_joe_> mafk: I am very uncomfortable with it and I guess we should've discussed introducing it [15:10:11] _joe_: don't blame me, I'm not it's owner [15:10:17] <_joe_> mafk: I'm not :) [15:10:21] nor did participate on its creation, etc. [15:10:32] Platonides did it because of the recurrent troll [15:10:43] :) [15:10:54] <_joe_> I'm not blaming anyone either, I'm just saying i'd like to discuss that, and make it a shared decision to add a banbot here [15:11:33] mark: paravoid: Alex is not around can someone please check this for me i am block https://phabricator.wikimedia.org/T143613 [15:12:22] didn't anyone say this week that installs in codfw didn't work? [15:14:50] mark: i just did and install this week or wezen the new syslog server [15:14:53] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:16:47] My account is OK? [15:16:52] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [15:18:18] papaul: both should work now [15:18:30] mark: thanks [15:19:38] (03PS1) 10Ema: cache_upload varnishtest: pass Range requests [puppet] - 10https://gerrit.wikimedia.org/r/306448 (https://phabricator.wikimedia.org/T142233) [15:19:42] <_joe_> xiaocao2015_: did you contact OTRS already? [15:19:57] OK [15:20:06] I sent Email [15:20:10] But no respond. [15:20:14] <_joe_> if so, it's already being handled [15:20:15] 3 days ago [15:20:33] <_joe_> they'll respond when they have an answer [15:20:37] 06Operations, 10ops-codfw, 06Discovery: codfw: rack/setup/deploy wdqs200[12]switch configuration - https://phabricator.wikimedia.org/T143613#2579179 (10mark) Configuration fixed on both switch stacks. [15:21:00] (03CR) 10Ema: [C: 032] cache_upload varnishtest: pass Range requests [puppet] - 10https://gerrit.wikimedia.org/r/306448 (https://phabricator.wikimedia.org/T142233) (owner: 10Ema) [15:21:43] ACKNOWLEDGEMENT - HP RAID on ms-be1027 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor - Failed: 2I:4:1, 2I:4:2, 1I:1:3, 1I:1:4 Filippo Giunchedi diagnosing disk errors, T140374 [15:21:44] ACKNOWLEDGEMENT - MD RAID on ms-be1027 is CRITICAL: CRITICAL: Active: 5, Working: 5, Failed: 1, Spare: 0 Filippo Giunchedi diagnosing disk errors, T140374 [15:22:24] !log installing wdqs200[1-2] [15:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:23:22] 06Operations, 10ops-eqiad, 10media-storage: diagnose failed disks on ms-be1027 - https://phabricator.wikimedia.org/T140374#2579187 (10fgiunchedi) or maybe not! just today there are four (!) faults reported ``` => ld all show Smart Array P840 in Slot 3 array A logicaldrive 1 (186.3 GB, RAID 0, Fail... [15:26:50] <_joe_> !log attempting to restart apache on rhodium, swapping, load exploding [15:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:27:00] 06Operations, 10ops-codfw: rack/setup/deploy puppetmaster200[12] - https://phabricator.wikimedia.org/T143255#2579199 (10Papaul) I chat with @Joe we are going to keep the hostnames puppetmaster2001 and puppetmaster2002 [15:27:42] PROBLEM - DPKG on rhodium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:27:43] PROBLEM - SSH on rhodium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:28:13] PROBLEM - MD RAID on rhodium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:29:34] RECOVERY - DPKG on rhodium is OK: All packages OK [15:29:42] RECOVERY - SSH on rhodium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [15:30:04] RECOVERY - MD RAID on rhodium is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [15:30:07] <_joe_> expect a shower of puppet failures [15:30:33] PROBLEM - puppet last run on ms-be1002 is CRITICAL: CRITICAL: puppet fail [15:30:55] (03CR) 10Mobrovac: [C: 04-1] role::deployment::server: fix scap3/trebuchet declarations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/306440 (https://phabricator.wikimedia.org/T143692) (owner: 10Giuseppe Lavagetto) [15:31:04] PROBLEM - puppet last run on mw2107 is CRITICAL: CRITICAL: Puppet has 1 failures [15:31:33] PROBLEM - puppet last run on cp3039 is CRITICAL: CRITICAL: puppet fail [15:31:52] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: puppet fail [15:32:23] PROBLEM - puppet last run on elastic1021 is CRITICAL: CRITICAL: puppet fail [15:32:42] PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: puppet fail [15:32:42] PROBLEM - puppet last run on ms-be1014 is CRITICAL: CRITICAL: puppet fail [15:32:42] PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: puppet fail [15:32:42] PROBLEM - puppet last run on chromium is CRITICAL: CRITICAL: puppet fail [15:32:43] PROBLEM - puppet last run on db1041 is CRITICAL: CRITICAL: puppet fail [15:32:43] PROBLEM - puppet last run on dbproxy1003 is CRITICAL: CRITICAL: puppet fail [15:32:43] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: puppet fail [15:32:43] PROBLEM - puppet last run on silver is CRITICAL: CRITICAL: puppet fail [15:32:44] PROBLEM - puppet last run on scb1001 is CRITICAL: CRITICAL: puppet fail [15:32:45] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: puppet fail [15:32:52] PROBLEM - puppet last run on snapshot1006 is CRITICAL: CRITICAL: puppet fail [15:32:53] PROBLEM - puppet last run on elastic2024 is CRITICAL: CRITICAL: puppet fail [15:32:53] PROBLEM - puppet last run on db1039 is CRITICAL: CRITICAL: puppet fail [15:32:53] PROBLEM - puppet last run on db1089 is CRITICAL: CRITICAL: puppet fail [15:32:53] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: puppet fail [15:32:53] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: puppet fail [15:32:54] PROBLEM - puppet last run on elastic1042 is CRITICAL: CRITICAL: puppet fail [15:32:54] PROBLEM - puppet last run on ms-be2015 is CRITICAL: CRITICAL: puppet fail [15:32:54] PROBLEM - puppet last run on nihal is CRITICAL: CRITICAL: puppet fail [15:32:54] PROBLEM - puppet last run on mc2005 is CRITICAL: CRITICAL: puppet fail [15:32:55] PROBLEM - puppet last run on mw2077 is CRITICAL: CRITICAL: puppet fail [15:32:56] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: puppet fail [15:32:56] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: puppet fail [15:32:57] PROBLEM - puppet last run on db1072 is CRITICAL: CRITICAL: puppet fail [15:33:02] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: puppet fail [15:33:02] PROBLEM - puppet last run on wtp1020 is CRITICAL: CRITICAL: Puppet has 6 failures [15:33:02] PROBLEM - puppet last run on wtp1005 is CRITICAL: CRITICAL: puppet fail [15:33:03] PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: puppet fail [15:33:03] PROBLEM - puppet last run on mw1289 is CRITICAL: CRITICAL: puppet fail [15:33:03] PROBLEM - puppet last run on ms-be2021 is CRITICAL: CRITICAL: puppet fail [15:33:20] <_joe_> ok I was about to do that [15:34:31] PROBLEM - puppet last run on db1073 is CRITICAL: CRITICAL: Puppet has 4 failures [15:34:32] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 37 failures [15:34:32] PROBLEM - puppet last run on mw2072 is CRITICAL: CRITICAL: puppet fail [15:34:32] PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: puppet fail [15:35:05] 06Operations, 10ops-eqiad: Rack/setup sodium (carbon/mirror server replacement) - https://phabricator.wikimedia.org/T139171#2579210 (10Cmjohnson) @faidon Would you be okay with 4TB disks instead of the 6TB disks we have now or would you want to go w/ SW raid? [15:35:07] 06Operations, 10ops-codfw: rack/setup/deploy puppetmaster200[12] - https://phabricator.wikimedia.org/T143255#2579211 (10Joe) Partitioning scheme should be `raid1-lvm-ext4.cfg` [15:35:30] PROBLEM - puppet last run on mw2069 is CRITICAL: CRITICAL: Puppet has 26 failures [15:35:30] PROBLEM - puppet last run on mw2144 is CRITICAL: CRITICAL: Puppet has 27 failures [15:35:30] PROBLEM - puppet last run on mw2152 is CRITICAL: CRITICAL: Puppet has 25 failures [15:35:30] PROBLEM - puppet last run on mw1203 is CRITICAL: CRITICAL: Puppet has 26 failures [15:35:30] PROBLEM - puppet last run on mw1226 is CRITICAL: CRITICAL: Puppet has 26 failures [15:35:32] PROBLEM - puppet last run on mw1180 is CRITICAL: CRITICAL: Puppet has 39 failures [15:35:33] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Puppet has 29 failures [15:35:43] PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 26 failures [15:35:44] PROBLEM - puppet last run on analytics1028 is CRITICAL: CRITICAL: Puppet has 12 failures [15:35:52] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 26 failures [15:35:53] PROBLEM - puppet last run on mw2062 is CRITICAL: CRITICAL: Puppet has 34 failures [15:35:53] PROBLEM - puppet last run on mw1165 is CRITICAL: CRITICAL: Puppet has 34 failures [15:36:02] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 26 failures [15:36:02] PROBLEM - puppet last run on mw2157 is CRITICAL: CRITICAL: Puppet has 39 failures [15:36:32] what does these problems mean? [15:36:51] _joe_: ^ [15:37:41] <_joe_> doctaxon: that the puppetmaster (the server for our configuration management software, puppet) [15:37:44] you should voice it to avoid some Excess Flood [15:37:46] <_joe_> had a failure [15:38:00] <_joe_> so all the hosts fail running puppet [15:38:07] <_joe_> it has no user-facing impact [15:38:07] 06Operations, 10ops-eqiad, 10media-storage: diagnose failed disks on ms-be1027 - https://phabricator.wikimedia.org/T140374#2579221 (10Cmjohnson) I see the 4 failed disks (amber lights) on the sever....I am finding it hard to believe that this server was shipped w/ so many bad disks. Has to be something else. [15:38:30] <_joe_> if you look at the SAL, I restarted apache there and added "expect a flood of puppet failures" [15:38:47] <_joe_> to see what was happening, look here [15:38:49] <_joe_> https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&c=Miscellaneous+eqiad&h=rhodium.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=NOGROUPS [15:38:52] _joe_: has it impact to grid [15:38:59] ? [15:39:18] <_joe_> gridengine? no that's not even part of the network that uses rhodium as a puppetmaster [15:39:26] okay [15:40:45] (03PS3) 10Dzahn: contint: Java 8 on Jessie slaves [puppet] - 10https://gerrit.wikimedia.org/r/295880 (https://phabricator.wikimedia.org/T138506) (owner: 10Hashar) [15:41:11] !log ema@palladium conftool action : set/pooled=yes; selector: cp4005.ulsfo.wmnet (tags: ['dc=ulsfo', 'cluster=cache_upload', 'service=nginx']) [15:41:12] !log ema@palladium conftool action : set/pooled=yes; selector: cp4005.ulsfo.wmnet (tags: ['dc=ulsfo', 'cluster=cache_upload', 'service=varnish-fe']) [15:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:45:14] RECOVERY - puppet last run on restbase1012 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [15:48:13] RECOVERY - puppet last run on mw2168 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [15:48:43] RECOVERY - puppet last run on mw1208 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [15:49:12] RECOVERY - puppet last run on mw1206 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:49:24] RECOVERY - puppet last run on mw2130 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:49:24] RECOVERY - puppet last run on mw1238 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [15:49:33] RECOVERY - puppet last run on analytics1056 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [15:49:42] RECOVERY - puppet last run on wasat is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [15:49:43] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [15:49:44] RECOVERY - puppet last run on mw1179 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:49:54] (03PS1) 10Cmjohnson: Removing dns entries for decom servers mw1090-1096 [dns] - 10https://gerrit.wikimedia.org/r/306452 [15:50:03] RECOVERY - puppet last run on mw2152 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [15:50:13] RECOVERY - puppet last run on mw1180 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [15:50:33] RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:50:34] RECOVERY - puppet last run on mw2062 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:50:44] RECOVERY - puppet last run on mw2200 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:51:03] RECOVERY - puppet last run on mw2133 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [15:51:04] RECOVERY - puppet last run on analytics1029 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:51:04] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [15:51:04] (03CR) 10Cmjohnson: [C: 032] Removing dns entries for decom servers mw1090-1096 [dns] - 10https://gerrit.wikimedia.org/r/306452 (owner: 10Cmjohnson) [15:51:12] RECOVERY - puppet last run on mw2098 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:51:22] RECOVERY - puppet last run on mw1169 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:51:23] RECOVERY - puppet last run on mw1181 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [15:51:33] RECOVERY - puppet last run on mw2094 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [15:51:39] 06Operations, 10MediaWiki-General-or-Unknown, 06Services, 10Traffic: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093#2579264 (10Jhernandez) AFAIK I don't think that'd be a problem. Maybe @dr0ptp4kt has more insight. [15:51:44] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/3: down - Core: cr2-codfw:xe-5/0/1 (Zayo, OGYX/120003//ZYO) 36ms {#2909} [10Gbps wave]BR [15:51:53] RECOVERY - puppet last run on mw1218 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [15:52:13] RECOVERY - puppet last run on mw2107 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:52:14] RECOVERY - puppet last run on aqs1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:52:23] RECOVERY - puppet last run on elastic1023 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [15:52:32] RECOVERY - puppet last run on analytics1032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:52:32] RECOVERY - puppet last run on mw1245 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:52:34] RECOVERY - puppet last run on cp3045 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [15:52:42] RECOVERY - puppet last run on mw1165 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:52:53] RECOVERY - puppet last run on ms-fe3001 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [15:52:53] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [15:53:02] RECOVERY - puppet last run on elastic2005 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [15:53:02] RECOVERY - puppet last run on ms-be2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:53:14] RECOVERY - puppet last run on mw1240 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [15:53:14] RECOVERY - puppet last run on mw2106 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:53:14] RECOVERY - puppet last run on db1073 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [15:53:19] (03CR) 10Chad: "Is this still needed? If not please abandon." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249402 (https://phabricator.wikimedia.org/T94029) (owner: 10Matthias Mullie) [15:53:22] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:53:23] RECOVERY - puppet last run on ms-be1025 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [15:53:24] RECOVERY - puppet last run on mw1270 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [15:53:27] 06Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests, 13Patch-For-Review: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2579269 (10Dzahn) >>! In T143138#2561115, @Mjohnson_WMF wrote: > @MZMcBride, I've asked Katy Love about https://grants.wikimedia.org/.... [15:53:33] RECOVERY - puppet last run on mw2167 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:53:33] RECOVERY - puppet last run on iridium is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [15:53:33] RECOVERY - puppet last run on mw2186 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [15:53:34] RECOVERY - puppet last run on mw2202 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:53:42] RECOVERY - puppet last run on dbstore1002 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [15:53:43] RECOVERY - puppet last run on mc1018 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [15:53:52] RECOVERY - puppet last run on wtp1006 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [15:53:52] RECOVERY - puppet last run on analytics1041 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [15:53:52] RECOVERY - puppet last run on mw1233 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:53:53] RECOVERY - puppet last run on ganeti2002 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [15:53:54] RECOVERY - puppet last run on mw1198 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:53:54] RECOVERY - puppet last run on mw2091 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [15:53:54] RECOVERY - puppet last run on ms-be1003 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [15:54:02] RECOVERY - puppet last run on mw1200 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:54:02] RECOVERY - puppet last run on mw2248 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [15:54:02] RECOVERY - puppet last run on db1072 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [15:54:02] RECOVERY - puppet last run on cp3019 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [15:54:03] RECOVERY - puppet last run on wtp1020 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [15:54:12] (03CR) 10Chad: "Is this still needed? Seems harmless either way..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228618 (https://phabricator.wikimedia.org/T90612) (owner: 10Legoktm) [15:54:12] RECOVERY - puppet last run on analytics1040 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [15:54:12] RECOVERY - puppet last run on heze is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [15:54:13] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [15:54:13] RECOVERY - puppet last run on mw2144 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:54:13] RECOVERY - puppet last run on es2002 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [15:54:13] RECOVERY - puppet last run on db2011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:54:13] RECOVERY - puppet last run on db2067 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:54:14] RECOVERY - puppet last run on maps-test2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:54:14] RECOVERY - puppet last run on mc2004 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [15:54:23] RECOVERY - puppet last run on pybal-test2002 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [15:54:23] RECOVERY - puppet last run on mw2217 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [15:54:23] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [15:54:33] RECOVERY - puppet last run on wtp1010 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [15:54:33] RECOVERY - puppet last run on mc1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:54:39] (03Abandoned) 10Dzahn: realm: add 'projectcom' to private wiki list [puppet] - 10https://gerrit.wikimedia.org/r/305095 (https://phabricator.wikimedia.org/T143138) (owner: 10Dzahn) [15:54:43] RECOVERY - puppet last run on analytics1028 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [15:54:43] RECOVERY - puppet last run on db1011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:54:43] RECOVERY - puppet last run on ganeti2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:54:43] RECOVERY - puppet last run on mw2234 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [15:54:43] RECOVERY - puppet last run on es2014 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [15:54:43] RECOVERY - puppet last run on rdb2003 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [15:54:43] RECOVERY - puppet last run on mw1215 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [15:54:43] RECOVERY - puppet last run on ms-be1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:54:43] RECOVERY - puppet last run on labstore2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:54:45] RECOVERY - puppet last run on prometheus2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:54:45] RECOVERY - puppet last run on wtp1016 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [15:54:46] (03CR) 10Chad: "Is this needed still?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208655 (https://phabricator.wikimedia.org/T94416) (owner: 10Aude) [15:54:46] RECOVERY - puppet last run on mw1305 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:55:03] RECOVERY - puppet last run on fermium is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [15:55:03] RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:55:03] RECOVERY - puppet last run on elastic2014 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [15:55:05] (03Abandoned) 10Dzahn: add projectcom.wikimedia.org for new private wiki [dns] - 10https://gerrit.wikimedia.org/r/305120 (https://phabricator.wikimedia.org/T143138) (owner: 10Dzahn) [15:55:12] RECOVERY - puppet last run on elastic2006 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [15:55:13] RECOVERY - puppet last run on ms-be1019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:55:13] RECOVERY - puppet last run on cp1054 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [15:55:13] RECOVERY - puppet last run on analytics1053 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:55:14] RECOVERY - puppet last run on graphite1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:55:14] RECOVERY - puppet last run on mw2209 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:55:14] RECOVERY - puppet last run on mw1185 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:55:23] RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [15:55:24] RECOVERY - puppet last run on cp1055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:55:32] RECOVERY - puppet last run on db1045 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [15:55:33] RECOVERY - puppet last run on mw1222 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [15:55:33] RECOVERY - puppet last run on elastic1021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:55:33] RECOVERY - puppet last run on mw1219 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:55:33] RECOVERY - puppet last run on db1084 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:55:34] RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [15:55:42] RECOVERY - puppet last run on wtp1008 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [15:55:42] RECOVERY - puppet last run on labsdb1008 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [15:55:42] RECOVERY - puppet last run on ms-be2020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:55:42] RECOVERY - puppet last run on mw1262 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:55:42] RECOVERY - puppet last run on dbproxy1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:55:43] RECOVERY - puppet last run on mw2076 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:55:43] RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [15:55:43] RECOVERY - puppet last run on mc1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:55:43] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:55:44] RECOVERY - puppet last run on db1056 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [15:55:45] RECOVERY - puppet last run on mw2141 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:55:46] RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:55:52] What????????????? [15:55:54] WTF [15:55:58] It spamming [15:56:02] RECOVERY - puppet last run on elastic1020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:56:02] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [15:56:02] RECOVERY - puppet last run on mw1231 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:56:02] RECOVERY - puppet last run on mc2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:56:12] RECOVERY - puppet last run on mc2005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:56:12] (03CR) 10Giuseppe Lavagetto: [C: 031] Provide override file for base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/305635 (owner: 10Muehlenhoff) [15:56:12] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [15:56:13] RECOVERY - puppet last run on mw1220 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:56:13] RECOVERY - puppet last run on cp1068 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [15:56:14] RECOVERY - puppet last run on mw1284 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:56:21] (03CR) 10Chad: "This ok to move forward?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237686 (owner: 10Legoktm) [15:56:22] RECOVERY - puppet last run on subra is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [15:56:23] RECOVERY - puppet last run on mw2069 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:56:23] RECOVERY - puppet last run on db2064 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:56:23] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [15:56:23] RECOVERY - puppet last run on mc1017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:56:23] RECOVERY - puppet last run on db2058 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [15:56:24] RECOVERY - puppet last run on mw1203 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:56:24] RECOVERY - puppet last run on labvirt1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:56:24] RECOVERY - puppet last run on maps2003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:56:24] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [15:56:25] (03CR) 10Dzahn: [C: 032] contint: Java 8 on Jessie slaves [puppet] - 10https://gerrit.wikimedia.org/r/295880 (https://phabricator.wikimedia.org/T138506) (owner: 10Hashar) [15:56:25] RECOVERY - puppet last run on db1046 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:56:32] RECOVERY - puppet last run on es2013 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [15:56:32] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [15:56:32] RECOVERY - puppet last run on cp2002 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [15:56:32] RECOVERY - puppet last run on mw1226 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:56:33] RECOVERY - puppet last run on mw1228 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:56:33] RECOVERY - puppet last run on db2060 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:56:33] RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [15:56:34] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [15:56:34] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:56:38] (03CR) 10Chad: "Did we do CategoryTree yet? If not can we do it here?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268627 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson) [15:56:42] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [15:56:43] RECOVERY - puppet last run on graphite1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:56:43] RECOVERY - puppet last run on mw1276 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:56:44] RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:56:52] RECOVERY - puppet last run on druid1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:56:52] RECOVERY - puppet last run on mw1251 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [15:56:52] RECOVERY - puppet last run on mw2228 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [15:56:53] RECOVERY - puppet last run on db1086 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:56:53] RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:56:53] RECOVERY - puppet last run on sca1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:56:54] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:56:54] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:56:54] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [15:56:54] RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [15:56:54] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [15:57:03] RECOVERY - puppet last run on mw2250 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [15:57:04] RECOVERY - puppet last run on wtp2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:57:04] RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:57:04] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [15:57:04] RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:57:04] RECOVERY - puppet last run on ms-be2026 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:57:05] RECOVERY - puppet last run on ms-be2022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:57:05] RECOVERY - puppet last run on ms-be2025 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:57:05] RECOVERY - puppet last run on lvs1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:57:12] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [15:57:13] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [15:57:13] RECOVERY - puppet last run on rdb1005 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [15:57:22] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:57:23] RECOVERY - puppet last run on ms-be1010 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [15:57:23] RECOVERY - puppet last run on hydrogen is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [15:57:23] RECOVERY - puppet last run on db1043 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [15:57:24] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:57:24] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [15:57:26] (03CR) 10Chad: "Fine by me, shall we merge?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263342 (https://phabricator.wikimedia.org/T123188) (owner: 10Mdann52) [15:57:32] RECOVERY - puppet last run on pc1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:57:33] RECOVERY - puppet last run on mw2146 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [15:57:33] RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [15:57:42] RECOVERY - puppet last run on db1049 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [15:57:43] RECOVERY - puppet last run on conf1001 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [15:57:43] RECOVERY - puppet last run on es2018 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [15:57:44] (03CR) 10Chad: "Still needed for testing?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270146 (https://phabricator.wikimedia.org/T126628) (owner: 10Dereckson) [15:57:52] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:57:53] RECOVERY - puppet last run on wtp1012 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [15:57:53] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:57:54] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:57:54] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:58:04] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:58:04] RECOVERY - puppet last run on mw1177 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [15:58:05] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [15:58:05] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:58:05] RECOVERY - puppet last run on ms-be1014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:58:05] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:58:11] !log !log wdqs200[1-2] - signing puppet certs, salt-key, initial run [15:58:13] RECOVERY - puppet last run on db1089 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [15:58:13] RECOVERY - puppet last run on db1039 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [15:58:13] RECOVERY - puppet last run on elastic1042 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:58:13] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:58:13] RECOVERY - puppet last run on elastic2024 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [15:58:13] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:58:22] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:58:23] RECOVERY - puppet last run on mw2077 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [15:58:23] RECOVERY - puppet last run on wtp1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:58:23] RECOVERY - puppet last run on mw1289 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:58:24] RECOVERY - puppet last run on ms-be2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:58:32] RECOVERY - puppet last run on restbase1008 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [15:58:33] RECOVERY - puppet last run on analytics1045 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [15:58:33] RECOVERY - puppet last run on mw2138 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [15:58:33] RECOVERY - puppet last run on eventlog2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:58:33] RECOVERY - puppet last run on analytics1026 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [15:58:33] RECOVERY - puppet last run on mw2120 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:58:33] RECOVERY - puppet last run on db2062 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [15:58:34] RECOVERY - puppet last run on lvs1009 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [15:58:40] (03CR) 10Chad: [C: 04-1] "Inline nitpick, otherwise lgtm assuming this is still needed." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273376 (https://phabricator.wikimedia.org/T114700) (owner: 10BryanDavis) [15:58:42] RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:58:43] RECOVERY - puppet last run on elastic2010 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [15:58:43] RECOVERY - puppet last run on db2068 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [15:58:43] RECOVERY - puppet last run on elastic2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:58:43] RECOVERY - puppet last run on lvs2006 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [15:58:43] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:58:44] RECOVERY - puppet last run on maps2004 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [15:58:44] RECOVERY - puppet last run on restbase2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:58:45] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:58:45] RECOVERY - puppet last run on mw1249 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [15:58:52] RECOVERY - puppet last run on mw1195 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [15:58:53] RECOVERY - puppet last run on mw1172 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:58:53] RECOVERY - puppet last run on mw2198 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [15:58:53] RECOVERY - puppet last run on elastic1039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:59:02] RECOVERY - puppet last run on analytics1038 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [15:59:02] RECOVERY - puppet last run on kafka1012 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [15:59:02] RECOVERY - puppet last run on mw2095 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:59:03] RECOVERY - puppet last run on db1060 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:59:03] RECOVERY - puppet last run on scb2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:59:03] RECOVERY - puppet last run on mc1014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:59:03] RECOVERY - puppet last run on ms-be2011 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [15:59:12] RECOVERY - puppet last run on wtp1014 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [15:59:12] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [15:59:12] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:59:12] RECOVERY - puppet last run on labsdb1007 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [15:59:13] RECOVERY - puppet last run on wtp2019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:59:13] RECOVERY - puppet last run on wtp2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:59:13] RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:59:24] RECOVERY - puppet last run on mw2166 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [15:59:25] RECOVERY - puppet last run on mw2136 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:59:25] RECOVERY - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:59:25] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [15:59:25] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [15:59:32] RECOVERY - puppet last run on mw2132 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [15:59:32] RECOVERY - puppet last run on mw1199 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:59:32] RECOVERY - puppet last run on cp1064 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:59:33] RECOVERY - puppet last run on graphite2001 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [15:59:33] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:59:33] RECOVERY - puppet last run on planet2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:59:33] RECOVERY - puppet last run on alsafi is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:59:44] RECOVERY - puppet last run on mw1191 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [15:59:52] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [15:59:53] RECOVERY - puppet last run on elastic1038 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:59:55] (03CR) 10Chad: "We still want this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276518 (https://phabricator.wikimedia.org/T119100) (owner: 10CSteipp) [16:00:02] RECOVERY - puppet last run on mw1302 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:00:03] RECOVERY - puppet last run on mw2061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:00:03] RECOVERY - puppet last run on suhail is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:00:03] RECOVERY - puppet last run on mw1184 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [16:00:03] RECOVERY - puppet last run on rdb1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:00:03] RECOVERY - puppet last run on mw1168 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [16:00:03] RECOVERY - puppet last run on ms-be1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:00:04] RECOVERY - puppet last run on ununpentium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:00:12] RECOVERY - puppet last run on db1091 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:00:12] RECOVERY - puppet last run on mw1282 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:00:13] RECOVERY - puppet last run on mw1202 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:00:14] RECOVERY - puppet last run on praseodymium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:00:14] RECOVERY - puppet last run on mw1246 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [16:00:23] RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:00:24] RECOVERY - puppet last run on db1041 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [16:00:24] RECOVERY - puppet last run on dbproxy1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:00:24] RECOVERY - puppet last run on aqs1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:00:24] RECOVERY - puppet last run on silver is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:00:27] (03PS1) 10Brian Wolff: Record content security policy events in log stash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306454 [16:00:33] RECOVERY - puppet last run on snapshot1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:00:33] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [16:00:34] RECOVERY - puppet last run on ms-be2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:00:37] (03CR) 10Dzahn: "so let's keep stderr and throw away stdout. where should it log to though?" [puppet] - 10https://gerrit.wikimedia.org/r/298785 (https://phabricator.wikimedia.org/T132324) (owner: 10Elukey) [16:00:38] (03CR) 10Chad: "Other than the inline complaint already listed, is this still wanted/needed?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277585 (owner: 10Ori.livneh) [16:00:42] RECOVERY - puppet last run on nihal is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:00:43] RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:00:53] RECOVERY - puppet last run on mw1288 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:00:53] RECOVERY - puppet last run on db2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:00:53] RECOVERY - puppet last run on mc2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:00:53] RECOVERY - puppet last run on osmium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:01:04] RECOVERY - puppet last run on db2033 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:01:04] RECOVERY - puppet last run on francium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:01:04] RECOVERY - puppet last run on mw2235 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:01:04] RECOVERY - puppet last run on mw2232 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:01:08] 06Operations, 10ops-eqiad, 06Discovery, 10Elasticsearch, 03Discovery-Search-Sprint: Improve balance of nodes across rows for elasticsearch cluster eqiad - https://phabricator.wikimedia.org/T143685#2579296 (10Cmjohnson) I have space to add 3 elasticsearch servers to each of these racks A6, B6 and C5 Plea... [16:01:13] RECOVERY - puppet last run on mw1261 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:01:18] (03PS6) 10Jcrespo: Configuration changes regarding mysql exporter for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/306404 (https://phabricator.wikimedia.org/T126757) [16:01:21] (03CR) 10Chad: "Fine by me if someone's around to help shepherd the change into production including sites* table fixes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277519 (https://phabricator.wikimedia.org/T126146) (owner: 10Nikerabbit) [16:01:22] RECOVERY - puppet last run on mw1258 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:01:24] RECOVERY - puppet last run on mw2177 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:01:24] RECOVERY - puppet last run on cp1048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:01:24] RECOVERY - puppet last run on db2049 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:01:32] RECOVERY - puppet last run on mw1303 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:01:32] RECOVERY - puppet last run on ms-be1008 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:01:33] RECOVERY - puppet last run on mw1234 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:01:33] RECOVERY - puppet last run on mw1259 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:01:33] RECOVERY - puppet last run on mw2191 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:01:34] RECOVERY - puppet last run on mc2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:01:34] RECOVERY - puppet last run on mw2162 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:01:42] RECOVERY - puppet last run on mw2181 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:01:52] RECOVERY - puppet last run on analytics1034 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:01:52] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:01:53] RECOVERY - puppet last run on sca2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:01:54] RECOVERY - puppet last run on mw2088 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:02:02] (03CR) 10Dzahn: "want IPv6 on the puppetmaster? It already has the IP on the interface, just no AAAA record" [dns] - 10https://gerrit.wikimedia.org/r/302624 (owner: 10Dzahn) [16:02:04] RECOVERY - puppet last run on mw2072 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:02:14] RECOVERY - puppet last run on mw1183 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:02:24] RECOVERY - puppet last run on mw2125 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:02:26] (03CR) 10Andrew Bogott: "commonsettings-labs is the beta cluster, right? In which case I have no opinion :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276518 (https://phabricator.wikimedia.org/T119100) (owner: 10CSteipp) [16:02:57] (03CR) 10Chad: "We should move this forward..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280170 (https://phabricator.wikimedia.org/T129586) (owner: 10Reedy) [16:03:14] (03CR) 10Dzahn: "IPv6 on the maintenance servers? These don't have the 'mapped' addresses yet. deployment servers do" [puppet] - 10https://gerrit.wikimedia.org/r/302649 (owner: 10Dzahn) [16:03:43] mutante: danke for the java8 thing :) [16:04:02] (03CR) 10Dzahn: "Thank you Bryan, also for adding legoktm" [puppet] - 10https://gerrit.wikimedia.org/r/298906 (owner: 10Dzahn) [16:04:14] (03CR) 10Chad: "Yeah, I realized that right after I added you as a reviewer :p" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276518 (https://phabricator.wikimedia.org/T119100) (owner: 10CSteipp) [16:05:07] (03CR) 10Jcrespo: [C: 032] Configuration changes regarding mysql exporter for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/306404 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [16:05:28] mutante: definitely no more AAAA that isn't mapped (uses dynamic macaddr-based addresses) [16:05:41] (03CR) 10Chad: "What's the status here?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281973 (https://phabricator.wikimedia.org/T27397) (owner: 10Matanya) [16:07:47] (03PS1) 10Brian Wolff: Fix API URL for Content security policy experiment on upload [puppet] - 10https://gerrit.wikimedia.org/r/306456 [16:07:57] (03CR) 10Chad: "Status?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289780 (https://phabricator.wikimedia.org/T119736) (owner: 10CSteipp) [16:08:54] 06Operations, 10ops-codfw, 06Discovery: rack/setup/deploy wdqs200[12] - https://phabricator.wikimedia.org/T142864#2579318 (10Papaul) [16:08:55] (03CR) 10Chad: "Fine by me, Darian do we want to roll this out to officewiki?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290278 (https://phabricator.wikimedia.org/T135889) (owner: 10CSteipp) [16:09:35] 06Operations, 10ops-codfw, 06Discovery: rack/setup/deploy wdqs200[12] - https://phabricator.wikimedia.org/T142864#2549225 (10Papaul) a:05Papaul>03Gehel @Gehel installation complete [16:10:13] PROBLEM - Varnishkafka log producer on cp4005 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [16:15:06] !log ema@palladium conftool action : set/pooled=no; selector: cp4005.ulsfo.wmnet (tags: ['dc=ulsfo', 'cluster=cache_upload', 'service=nginx']) [16:15:07] !log ema@palladium conftool action : set/pooled=no; selector: cp4005.ulsfo.wmnet (tags: ['dc=ulsfo', 'cluster=cache_upload', 'service=varnish-fe']) [16:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:16:03] hashar: de rien [16:16:33] bblack: yes, that was the point. some already have the mapped addresses but no AAAA record, some dont have mapped yet [16:18:25] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:18:33] RECOVERY - Varnishkafka log producer on cp4005 is OK: PROCS OK: 1 process with command name varnishkafka [16:21:22] 06Operations, 10ops-eqiad: Rack/setup sodium (carbon/mirror server replacement) - https://phabricator.wikimedia.org/T139171#2579392 (10faidon) 4x4TB + HWRAID would be preferrable. In any case Dell should refund us the difference. [16:24:11] 06Operations, 10ops-codfw: codfw: rack/setup/deploy puppetmaster200[12]switch configuration - https://phabricator.wikimedia.org/T143800#2579430 (10Papaul) [16:26:06] 06Operations, 10ops-codfw, 10netops: codfw: rack/setup/deploy puppetmaster200[12]switch configuration - https://phabricator.wikimedia.org/T143800#2579430 (10Papaul) [16:28:17] 06Operations, 10ops-codfw: rack/setup/deploy puppetmaster200[12] - https://phabricator.wikimedia.org/T143255#2579451 (10mark) [16:28:19] 06Operations, 10ops-codfw, 10netops: codfw: rack/setup/deploy puppetmaster200[12]switch configuration - https://phabricator.wikimedia.org/T143800#2579449 (10mark) 05Open>03Resolved Done, put in private vlans. [16:34:29] 06Operations, 06Labs: grafana-labs.wikimedia.org doesn't reflect grafana-labs-admin.wikimedia.org - https://phabricator.wikimedia.org/T143556#2579482 (10fgiunchedi) some progress: the global user "Anonymous" lacked "Viewer" access to the default/main organization, now all dashboards are visible on https://graf... [16:35:02] (03PS1) 10Aaron Schulz: Fix MWMultiVersion IDEA warnings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306460 [16:43:06] 06Operations, 06Labs: 4.4-series kernel vs. iptables - https://phabricator.wikimedia.org/T142388#2579538 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [16:44:47] (03CR) 1020after4: WIP: Scap swat command (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) (owner: 1020after4) [16:45:01] (03CR) 10Thcipriani: [C: 04-1] "Comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/306429 (owner: 10Giuseppe Lavagetto) [16:45:53] (03PS1) 10Chad: Multiversion: delete deleteMediaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306462 [16:46:54] twentyafterfour, thcipriani: Lol followup ^ [16:47:04] what? [16:47:27] ah [16:47:28] ok [16:47:31] delete deleteMediawiki [16:48:44] Platonides: I'm actually removing all of MediaWiki from the cluster today ;-) [16:48:45] jjk [16:48:48] (03PS1) 10EBernhardson: Cirrus: Send more like this queries to default cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306463 [16:49:05] 我的账号好了吗 [16:49:11] My account is OK? [16:49:29] (03CR) 10BryanDavis: scap::source: use puppet to manage directory creation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/306429 (owner: 10Giuseppe Lavagetto) [16:49:29] * Platonides checks the date [16:49:34] 06Operations, 10ops-codfw: rack/setup/deploy puppetmaster200[12] - https://phabricator.wikimedia.org/T143255#2579564 (10Papaul) [16:49:58] ostriches: sudo ntptime pool.ntp.org [16:50:21] xiaocao2015_: which account? [16:50:26] you are not identified on irc [16:50:34] if that's the question [16:50:39] [[User:平天下的小曹2015]] [16:50:43] It has sha256 [16:50:45] !log bounce uwsgi on labmon1001 T143556 [16:50:46] T143556: grafana-labs.wikimedia.org doesn't reflect grafana-labs-admin.wikimedia.org - https://phabricator.wikimedia.org/T143556 [16:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:51:02] on which wiki? [16:51:04] A question:Who can use mediawiki api? [16:51:10] anyone [16:51:11] Anyone can use the api. [16:51:21] 06Operations, 10Wikimedia-Logstash, 03Discovery-Search-Sprint: Elasticsearch restarts are failing in the logstash cluster - https://phabricator.wikimedia.org/T142357#2579593 (10Gehel) Number of fields per index is high for indices created before August 1st: logstash-2016.07.25 : 106796 logstash-2016.07.... [16:51:41] yuvipanda: success! https://grafana-labs.wikimedia.org/dashboard/db/tools-basic-alerts [16:51:51] the account 平天下的小曹2015 is globally blocked… [16:52:31] 我知道 [16:52:40] A question:Who can tell me how to use mediawiki api? [16:52:54] 平天下的小曹2015 is globally blocked because it was be hacked. [16:53:03] I tell to steward to block it. [16:53:11] A question:Who can tell me how to use mediawiki api? [16:53:17] https://www.mediawiki.org/wiki/API:Main_page [16:53:17] That will tell you how to use the API [16:53:32] xiaocao2015_: look at https://www.mediawiki.org/wiki/API:Main_page [16:53:33] I want to make a software to post a lot of [16:53:38] 06Operations, 10Wikimedia-Logstash, 03Discovery-Search-Sprint: Elasticsearch restarts are failing in the logstash cluster - https://phabricator.wikimedia.org/T142357#2579597 (10EBernhardson) Seems legitimate to me [16:53:40] (03PS1) 10Brian Wolff: Expand CSP report only test to frwiki. [puppet] - 10https://gerrit.wikimedia.org/r/306464 [16:53:54] I know.But it often tell me "Invaild token" [16:54:15] twentyafterfour: Sorta related....I would love to send the php/ and p/ symlinks to a special place in hell. [16:54:31] prolly impossible. [16:54:48] Use EXE or bat to send a lot of post [16:55:05] How to use bat porgram send a lot of post [16:55:49] xiaocao2015_: you realise that if you are blocked, you won't be able to write using the api, right? [16:56:24] NOT [16:56:32] I using api to write my wiki [16:56:33] {"error":{"code":"badtoken","info":"Invalid token","*":"See http://192.168.0.102:2333/api.php for API usage"}} [16:57:24] well, you token is invalid… [16:57:32] OK [16:57:46] How to send post requst in windows [16:58:19] (03CR) 10Brian Wolff: "Note, that the rather embarrassing typo that will be fixed in Ie65989e40, prevented the reports from being process properly, however, one " [puppet] - 10https://gerrit.wikimedia.org/r/306464 (owner: 10Brian Wolff) [16:58:37] pip install pywikibot [16:58:50] ? [16:58:57] Windows NOT LINUX [16:59:34] is this really on-topic for this channel at this point? [16:59:45] nope [16:59:50] * Platonides thinks it's better to leave this [16:59:55] (03PS1) 10Filippo Giunchedi: graphite: parametrize cors_origins for labs [puppet] - 10https://gerrit.wikimedia.org/r/306467 (https://phabricator.wikimedia.org/T143556) [16:59:57] ... [17:00:11] Platonides: Already did ;-) [17:00:27] sorry? [17:00:33] oh [17:00:34] nvm [17:00:35] lol [17:00:36] xD [17:00:40] ;-) [17:00:53] so you now realised we're in August, then :P [17:01:06] (03PS1) 10Papaul: DNS: Add mgmt and production DNS enntries for puppetmaster200[1-2] Bug:T143255 [dns] - 10https://gerrit.wikimedia.org/r/306469 (https://phabricator.wikimedia.org/T143255) [17:02:12] (03CR) 10Thcipriani: "lgtm, one minor question inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/306430 (owner: 10Giuseppe Lavagetto) [17:02:32] 06Operations, 10ops-codfw: rack/setup/deploy puppetmaster200[12] - https://phabricator.wikimedia.org/T143255#2579617 (10Papaul) [17:03:02] (03PS2) 10BBlack: Fix API URL for Content security policy experiment on upload [puppet] - 10https://gerrit.wikimedia.org/r/306456 (owner: 10Brian Wolff) [17:03:35] (03CR) 10BBlack: [C: 032 V: 032] Fix API URL for Content security policy experiment on upload [puppet] - 10https://gerrit.wikimedia.org/r/306456 (owner: 10Brian Wolff) [17:04:46] !log demon@tin Synchronized wmf-config/: Remove old obsolete ExtensionMessages files (duration: 00m 49s) [17:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:11:56] godog awesome! [17:12:04] (03CR) 10Yuvipanda: [C: 031] "<3" [puppet] - 10https://gerrit.wikimedia.org/r/306467 (https://phabricator.wikimedia.org/T143556) (owner: 10Filippo Giunchedi) [17:12:04] bblack: thanks :) [17:12:09] godog +1'd [17:14:13] (03CR) 1020after4: [C: 031] Multiversion: delete deleteMediaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306462 (owner: 10Chad) [17:17:39] (03PS2) 10Papaul: DNS: Add mgmt and production DNS entries for puppetmaster200[1-2] Bug:T143255 [dns] - 10https://gerrit.wikimedia.org/r/306469 (https://phabricator.wikimedia.org/T143255) [17:17:42] (03PS3) 10Dzahn: DNS: Add mgmt and production DNS entries for puppetmaster200[1-2] Bug:T143255 [dns] - 10https://gerrit.wikimedia.org/r/306469 (https://phabricator.wikimedia.org/T143255) (owner: 10Papaul) [17:19:52] (03PS1) 10Jcrespo: Fix mysqld exporter prometheus config not working in trusty [puppet] - 10https://gerrit.wikimedia.org/r/306470 [17:20:54] (03CR) 10jenkins-bot: [V: 04-1] Fix mysqld exporter prometheus config not working in trusty [puppet] - 10https://gerrit.wikimedia.org/r/306470 (owner: 10Jcrespo) [17:23:00] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/3: down - Core: cr2-codfw:xe-5/0/1 (Zayo, OGYX/120003//ZYO) 36ms {#2909} [10Gbps wave]BR [17:23:12] (03PS1) 10Chad: WIP: Remove "p" symlink from WMF config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306471 [17:24:04] (03PS2) 10Jcrespo: Fix mysqld exporter prometheus config not working in trusty [puppet] - 10https://gerrit.wikimedia.org/r/306470 (https://phabricator.wikimedia.org/T126757) [17:25:00] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [17:25:09] (03CR) 10jenkins-bot: [V: 04-1] Fix mysqld exporter prometheus config not working in trusty [puppet] - 10https://gerrit.wikimedia.org/r/306470 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [17:26:28] (03CR) 10Dzahn: [C: 032] DNS: Add mgmt and production DNS entries for puppetmaster200[1-2] Bug:T143255 [dns] - 10https://gerrit.wikimedia.org/r/306469 (https://phabricator.wikimedia.org/T143255) (owner: 10Papaul) [17:26:34] (03PS3) 10Jcrespo: Fix mysqld exporter prometheus config not working in trusty [puppet] - 10https://gerrit.wikimedia.org/r/306470 (https://phabricator.wikimedia.org/T126757) [17:27:26] !log deleting logstash indices from before august 1st (T142357) [17:27:27] T142357: Elasticsearch restarts are failing in the logstash cluster - https://phabricator.wikimedia.org/T142357 [17:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:27:59] (03PS4) 10Jcrespo: Fix mysqld exporter prometheus config not working in trusty [puppet] - 10https://gerrit.wikimedia.org/r/306470 (https://phabricator.wikimedia.org/T126757) [17:30:06] (03CR) 10Jcrespo: [C: 032] Fix mysqld exporter prometheus config not working in trusty [puppet] - 10https://gerrit.wikimedia.org/r/306470 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [17:35:15] (03CR) 10Dzahn: Adding Icinga checks for Maps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/291023 (https://phabricator.wikimedia.org/T135647) (owner: 10Gehel) [17:38:18] yuvipanda: I have to run now, though https://puppet-compiler.wmflabs.org/3829/ looks good, could you merge/babysit it ? puppet on labmon1001 is disabled because I tried locally first [17:38:41] godog sure thing. [17:38:43] I'll do it later [17:39:21] yuvipanda: tyvm! worst case I'll merge it tomorrow UTC morning [17:39:56] (03CR) 10Alex Monk: [C: 031] "I independently made an almost identical commit: Id42124e3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280170 (https://phabricator.wikimedia.org/T129586) (owner: 10Reedy) [17:40:26] 06Operations, 10Wikimedia-Mailing-lists: deactivate maint-announce - https://phabricator.wikimedia.org/T143760#2579743 (10Dzahn) root@fermium:/var/lib/mailman/bin# ./remove_members -a maint-announce root@fermium:/usr/local/sbin# ./disable_list maint-announce maint-announce disabled. Archives should be availab... [17:40:27] (03Abandoned) 10Alex Monk: Remove old pre-Swift directory variables referencing upload7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298404 (https://phabricator.wikimedia.org/T64835) (owner: 10Alex Monk) [17:41:17] 06Operations, 10ops-eqiad: Rack/setup sodium (carbon/mirror server replacement) - https://phabricator.wikimedia.org/T139171#2579748 (10Southparkfan) I wonder why @RobH and @Cmjohnson are talking about 4TB disks. The current problems are caused by 4k (6TB) disks, and the accessoires link given by Cmjohnson ment... [17:41:19] 06Operations, 10ops-eqiad: Rack/setup sodium (carbon/mirror server replacement) - https://phabricator.wikimedia.org/T139171#2579749 (10Southparkfan) I wonder why @RobH and @Cmjohnson are talking about 4TB disks. The current problems are caused by 4k (6TB) disks, and the accessoires link given by Cmjohnson ment... [17:42:20] ?????????????????? [17:44:47] !log overwritting /etc/default/prometheus-mysqld-exporter on all trusty mysql nodes on codfw [17:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:45:52] 06Operations, 10Wikimedia-Mailing-lists: deactivate maint-announce - https://phabricator.wikimedia.org/T143760#2579760 (10Dzahn) 05Open>03Resolved removed the part of the exim alias that forwards it to lists, in private repo commit f4465ebd308e9538b3ab940 it used to forward to both RT and mailman, no... [17:47:06] 06Operations, 10Phabricator: migrate RT maint-announce into phabricator - https://phabricator.wikimedia.org/T118176#2579765 (10Dzahn) mailing list removed again by request. i don't know if moving this to phabricator is a thing that we want to happen in the future or never. [17:47:53] 06Operations, 10ops-eqiad: Rack/setup sodium (carbon/mirror server replacement) - https://phabricator.wikimedia.org/T139171#2579768 (10Cmjohnson) @southparkfan The issue is with the 4k and 512e disks. However, in order to replace them with something other than 4k and 512e disks we will need to reduce the capac... [17:48:12] 06Operations, 05WMF-NDA: Migrate RT to Phabricator - https://phabricator.wikimedia.org/T38#2579770 (10Dzahn) blocked by the procurement queue issues and maybe T118176 [17:55:19] (03CR) 10Brian Wolff: [C: 04-1] "Actually, lets maybe let the other change to the CSP header be on the site for a little bit before doing this." [puppet] - 10https://gerrit.wikimedia.org/r/306464 (owner: 10Brian Wolff) [18:00:04] anomie, ostriches, thcipriani, hashar, and twentyafterfour: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160824T1800). [18:00:04] Amir1, bawolff, and ebernhardson: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:20] hey! [18:01:34] I can SWAT today [18:02:42] awesome [18:02:53] Amir1: looking at how to sync your patch CommonSettings.php then InitialiseSettings.php should prevent any errors, correct? [18:03:08] (03PS2) 10Aaron Schulz: Fix MWMultiVersion IDEA warnings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306460 [18:03:27] thcipriani: the InitialiseSettings.php is just redundant fix [18:03:37] the order doesn't matter [18:03:52] okie doke [18:04:04] (03PS2) 10Thcipriani: Change ORES threshold to soft as default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306435 (https://phabricator.wikimedia.org/T143738) (owner: 10Ladsgroup) [18:04:25] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306435 (https://phabricator.wikimedia.org/T143738) (owner: 10Ladsgroup) [18:05:03] (03Merged) 10jenkins-bot: Change ORES threshold to soft as default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306435 (https://phabricator.wikimedia.org/T143738) (owner: 10Ladsgroup) [18:06:13] uhh, hmm. Logstash maintenance ongoing? [18:06:55] well, I guess I see a restart from 4 hours ago, but nothing recently in the SAL [18:08:03] thcipriani: afaik logstash maintenance is done. it wouldn't really effect swat anyways [18:08:09] (03PS3) 10Aaron Schulz: Fix MWMultiVersion IDEA warnings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306460 [18:08:30] (in terms of being able to check things are working, the maintenance shouldn't effect the ability of the cluster to accept and query logs) [18:08:46] ebernhardson: eh, I'm seeing the "Our servers are currently under maintenance or experiencing a technical problem." when I visit logstash :\ [18:08:53] hmm [18:10:25] kibana looks to be up and running afaik [18:11:03] (03CR) 10Dduvall: "> I suppose this can only be deployed during maintenance, right?" [puppet] - 10https://gerrit.wikimedia.org/r/305668 (https://phabricator.wikimedia.org/T138778) (owner: 10Dduvall) [18:11:22] from the backed or when you hit it in the browser? [18:11:29] *backend [18:11:36] thcipriani: the backends, the browser indeed reports (from varnish) a failure [18:11:51] ah, kk, wasn't sure if it was on my end somewhere [18:13:29] I can definitely still hit logstash from the deployment host so the logstash check for scap should still work at least. [18:13:31] oh, hmm i think i see a problem but its very odd...the /status endpoint which varnish checks is returning a 503 to wget (but a 200 to curl). something is odd but not sure what yet...investigating [18:14:31] thanks. I'm going to pause SWAT for the time being. Deploying without a good view on logs (even with fatalmonitor of fluorine) seems like a bad idea :) [18:15:03] okay, I'm around [18:17:06] (03CR) 10ArielGlenn: [C: 031] "After scrying puppet, mediawiki-config/wmf-config/filebackend*, *Settings.php and a few other MW files I think this is safe. But I'm not " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280170 (https://phabricator.wikimedia.org/T129586) (owner: 10Reedy) [18:17:09] thcipriani: try now [18:17:25] !log restarted kibana on logstash1001 [18:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:17:43] ebernhardson: back in business! Thank you! [18:17:47] !log reenabling cr2-codfw:xe-5/0/1 (link to cr2-eqiad), recovered since 17:02 UTC [18:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:18:13] kibana was stuck on 1001, but fine on 1002 and 1003. varnish really should have kept sending requests just not to 1001 though ... probably another reason to think about LVS for logstash [18:18:59] (03CR) 10BBlack: "Is frwiki a good choice here in terms of potential report volume? Something like svwiki or elwiki would still give significant real reque" [puppet] - 10https://gerrit.wikimedia.org/r/306464 (owner: 10Brian Wolff) [18:19:16] Amir1: https://gerrit.wikimedia.org/r/#/c/306435 is on mw1099, check please [18:19:49] okay on it [18:20:44] thcipriani: works just fine [18:21:11] Amir1: ack, going everywhere [18:21:26] (03CR) 10Dzahn: [C: 031] Restart exim daily on Monday to Friday [puppet] - 10https://gerrit.wikimedia.org/r/294929 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [18:23:21] !log thcipriani@tin Synchronized wmf-config: SWAT: [[gerrit:306435|Change ORES threshold to soft as default (T143738)]] (duration: 01m 01s) [18:23:23] T143738: Edits being flagged by review tool on enwiki aren't likely to be damaging - https://phabricator.wikimedia.org/T143738 [18:23:24] ^ Amir1 live everywhere [18:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:23:38] Okay, testing [18:24:12] Awesome [18:24:18] :) [18:24:19] worked [18:24:21] Thanks [18:24:49] bawolff: ping for SWAT [18:24:59] Hi! [18:25:09] I'm here [18:25:14] okie doke [18:26:07] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306454 (owner: 10Brian Wolff) [18:26:18] (03PS2) 10Thcipriani: Record content security policy events in log stash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306454 (owner: 10Brian Wolff) [18:26:27] (03CR) 10Thcipriani: Record content security policy events in log stash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306454 (owner: 10Brian Wolff) [18:26:37] (03CR) 10Thcipriani: [C: 032] "SWAT again" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306454 (owner: 10Brian Wolff) [18:27:07] (03Merged) 10jenkins-bot: Record content security policy events in log stash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306454 (owner: 10Brian Wolff) [18:27:39] why does gerrit always do that to me? Won't let me rebase, (even when I know! it needs rebasing), CR+2, "Cannot merge". /me sighs [18:28:46] bawolff: patch is live on mw1099 if there's anything to check there [18:28:49] (03CR) 10Brian Wolff: "I admit, frwiki was chosen kind of randomly. Its expected that most reports would be generated when viewing an svg source file. frwiki has" [puppet] - 10https://gerrit.wikimedia.org/r/306464 (owner: 10Brian Wolff) [18:29:16] mw1099 = test.wikipedia.org when using the special http header, right? [18:29:33] you can use any wiki with the special header [18:29:37] yup [18:30:04] I could try and hand-craft an ajax request with that header to test it. Just a moment [18:30:16] thank you :) [18:32:47] ebernhardson: just saw your restart of kibana... seems that we are also missing an icinga check there... [18:33:41] hmm, header is X-Wikimedia-Debug: 1, right [18:34:12] to point at a specific backend you do: x-wikimedia-debug:backend=mw1099.eqiad.wmnet looks like [18:34:35] gehel: a check on /status couldn't hurt [18:35:05] or it could hit the api, /status is actually a static html page that then checks the api [18:35:08] bblack: hi! almost have a patch ready for the removal of geoiplookup. I hope I have a nice solution for eveyrone, including 3rd parties... Qukck question: if our sources can't locate a user, will the cookie be set to "", or maybe not set, or something else? Also, is this independent of whether the user already had a GeoIP cookie? [18:35:28] thcipriani: Thanks, that worked better [18:35:32] I can confirm it works [18:35:34] :) [18:35:49] awesome :) going live everywhere [18:37:29] AndyRussG: it will be set with the data ":::::v4" (all empty strings in the important fields). We forced the final legacy v4/v6 field to v4 to work around a transitional issue on our end. We're re-setting the old no-data IPv6 cookies ":::::v6" to new data on observation (should hit before CN sees anything) [18:37:35] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:306454|Record content security policy events in log stash]] (duration: 00m 52s) [18:37:39] ^ bawolff live everywhere [18:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:37:42] :) [18:37:48] of course, the could should handle the case that no cookie is set at all, due to a glitch or browser disable of cookies, etc [18:38:00] s/could/code/ :) [18:38:53] (03PS2) 10Thcipriani: Cirrus: Send more like this queries to default cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306463 (owner: 10EBernhardson) [18:39:13] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306463 (owner: 10EBernhardson) [18:39:26] (but in that case treating it the same as :::::v4 is appropriate, there still shouldn't be a fallback) [18:39:52] bblack: cool! Yeah, handling the case of no cookie [18:39:52] (03Merged) 10jenkins-bot: Cirrus: Send more like this queries to default cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306463 (owner: 10EBernhardson) [18:40:27] bblack: I think on the JS side we'll just consider cookie or no country field to mean, no GeoIP data. For our setup, there will be no background lookup [18:40:38] ebernhardson: patch is live on mw1099 [18:40:41] but I'm enabling a lightweight option for 3rd parties [18:42:37] thcipriani: looks good [18:42:47] ebernhardson: cool, going live everywhere [18:43:41] AndyRussG: sounds awesome, thanks [18:44:26] !log thcipriani@tin Synchronized wmf-config/CirrusSearch-production.php: SWAT: [[gerrit:306463|Cirrus: Send more like this queries to default cluster]] (duration: 00m 46s) [18:44:30] ^ ebernhardson live everywhere [18:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:45:05] thcipriani: i see traffic shifting in my graphs, should be good. I'll keep an eye on it [18:45:18] ebernhardson: sounds good. Thank you! [18:46:30] bblack: likewise thx!! [18:47:32] PROBLEM - Apache HTTP on mw1191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:48:01] PROBLEM - HHVM rendering on mw1191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:48:10] PROBLEM - HHVM rendering on mw1216 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:48:21] PROBLEM - Apache HTTP on mw1216 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:52:12] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:55:39] o/ [18:55:44] jouncebot: next [18:55:45] In 0 hour(s) and 4 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160824T1900) [18:56:14] thcipriani: is morning swat complete ? :) [18:56:58] hashar: yep. thanks for checking :) [18:58:01] going to promote group1 to .16 [18:58:13] boom goes the grrrit-wm [18:58:34] I thought we had that bot banned ? [18:58:42] somehow i managed to mess up my git/git review config and cant get it back to normal [18:59:00] even after remove with --purge and starting over [18:59:19] mutante: feel free to fill a task with output of: git remote -v ; git-review --verbose [18:59:31] andrew screwed his install yesterday as well [18:59:34] hashar: grrrit-wm banned? no, not at all [18:59:45] ok, looking at that [19:00:04] hashar: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160824T1900). Please do the needful. [19:00:28] (03CR) 10BBlack: [C: 031] "It's your call really. If you think the deluge from frwiki won't be awful I'm fine with it." [puppet] - 10https://gerrit.wikimedia.org/r/306464 (owner: 10Brian Wolff) [19:00:49] !log hashar@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.28.0-wmf.16 [19:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:02:43] those deployments are becoming annoying [19:02:51] we need a bot to [19:02:59] (03CR) 10Rush: [C: 032] deployment-prep: remove old deployment-fluorine hieradata [puppet] - 10https://gerrit.wikimedia.org/r/305765 (owner: 10Alex Monk) [19:03:03] (03PS2) 10Rush: deployment-prep: remove old deployment-fluorine hieradata [puppet] - 10https://gerrit.wikimedia.org/r/305765 (owner: 10Alex Monk) [19:03:05] !promote group1 1.28.0-wmf.16 [19:03:58] (03CR) 10Andrew Bogott: [C: 032] deployment-prep: Move udp2log to deployment-fluorine02 [puppet] - 10https://gerrit.wikimedia.org/r/305587 (owner: 10Alex Monk) [19:04:03] (03PS2) 10Andrew Bogott: deployment-prep: Move udp2log to deployment-fluorine02 [puppet] - 10https://gerrit.wikimedia.org/r/305587 (owner: 10Alex Monk) [19:04:41] (03CR) 10Andrew Bogott: [C: 032] mw-log-cleanup: remove wfDebug files in deployment-prep every week [puppet] - 10https://gerrit.wikimedia.org/r/305768 (owner: 10Alex Monk) [19:05:48] (03PS3) 10Andrew Bogott: deployment-prep: Move udp2log to deployment-fluorine02 [puppet] - 10https://gerrit.wikimedia.org/r/305587 (owner: 10Alex Monk) [19:06:04] (03PS1) 10Dzahn: installserver: make chromium use raid1-1part recipe [puppet] - 10https://gerrit.wikimedia.org/r/306492 [19:08:10] hashar: remote "gerrit" was using ssh:// remote "origin" was using https://, changing all remotes to ssh works. the part i dont get is _what_ changed that.. but anyways :) [19:08:34] mutante: no joke same thing happened to me [19:08:37] it was the same clone i used all the time [19:08:43] chasemp: oh.. interesting [19:08:58] I used git-review -r to set remote explicitly and it works [19:08:59] mutante: upgraded git-review maybe? [19:09:08] (03PS4) 10Andrew Bogott: Remove the hard-coded /a/mw-log references scattered around everywhere [puppet] - 10https://gerrit.wikimedia.org/r/305767 (owner: 10Alex Monk) [19:09:30] I am using HEAD of git-review installed from git (that is version 1.25.1.dev11 ) [19:09:36] chasemp: ah, i had just tried the --setup / -s , then edited .git/config with vi [19:09:44] mutante: that solved it? [19:09:48] chasemp: yes [19:09:49] it has an option to just use origin with the pull URL being https and the push URL being set automagically to ssh:// [19:09:52] huh [19:10:05] im using plain git [19:10:18] git for window, so much easer to install then git-review [19:10:37] hashar: not aware of an upgrade, using the one from distro (stretch), 1.25.0-2 [19:10:53] since on windows i just save the hooks file in c:\programes files\git [19:10:55] but now i purged and reinstalled it too [19:10:59] so group1 to .16 seems to work fine. Only issue I noticed is catchable fatal error with ProofReadpage https://phabricator.wikimedia.org/T143817 [19:11:32] mutante: I'm glad you said something, I thought I was going crazy [19:11:43] / Argument 1 passed to ProofreadPage::onArticlePurge() must be an instance of Article, WikiPage given // [19:12:49] mutante maybe a bug in gerrit? [19:13:06] chasemp: heh:) i was also very confused. so what i did right before that happened was hitting ctrl +c during a git-review run, maybe that ... [19:14:25] but like 3 users around the same time..odd [19:14:48] i guess its a bug [19:14:54] either with git-review or gerrit [19:15:47] to clarify, running git-review --setup did _not_ fix it, i had just tried that multiple times, what did fix it was manually editing the config files in .git/config [19:16:01] be back soon [19:17:06] bblack: With the content-security-policy thing, currently I'm trying to figure out an issue, where it seems to append the middle of the JSON post body to the end of the url [19:17:14] Which is confusing me greatly [19:18:11] e.g. the api reports getting a parameter like: format=json%22,%22referrer%22:%22%22,%22script-sample%22:%22onmouseover%20attribute%20on%20g%20element%22,%22source-file%22:%22https://upload.wikimedia.org/wikipedia/test/d/de/FastilyTestCircle2.svg%3Fadffdasd%22,%22violated-directive%22:%22default-src%20%27none%27%22%7D%7D [19:18:34] weird [19:18:43] indeed [19:19:37] And it only happens when browser generates the CSP report. It does not occur when making what should be an equivalent request via ajax [19:22:06] I guess I could prevent the format parameter from being mangled by adding an & to the end of the url, but I'd rather know what is actually happening here [19:31:29] bblack: tangential question... Are we ever dealing with cases where people have a GeoIP cookie, but it's wrong? i.e., international travelers with long-running browser sessions or browsers that keep session cookies alive for a while [19:44:13] possibly I'm running into https://github.com/facebook/hhvm/issues/6676 [19:44:25] (03CR) 10Hashar: "The BUILD_TIMEOUT logic can be reused to bump other repositories timeout. The default is 30 minutes which is more than enough for most use" [debs/contenttranslation/giella-sme] - 10https://gerrit.wikimedia.org/r/294430 (https://phabricator.wikimedia.org/T120087) (owner: 10KartikMistry) [19:44:33] (03PS2) 10Dzahn: installserver: make chromium use raid1-1part recipe [puppet] - 10https://gerrit.wikimedia.org/r/306492 [19:44:39] (03PS3) 10Dzahn: installserver: make chromium use raid1-1part recipe [puppet] - 10https://gerrit.wikimedia.org/r/306492 [19:47:01] (03PS2) 10Dzahn: recdns: remove chromium from LVS nameservers_override [puppet] - 10https://gerrit.wikimedia.org/r/306311 [19:47:55] !log change-prop deploying 28a0057 [19:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:50:53] AndyRussG: no, we never have dealt with that case. it would be woefully inefficient to recheck on every request. We could look at something like switching from a session cookie to a persistent cookie with a reasonably-short lifetime, though (say 6h?). I don't know if the edge-case is worth worrying about too much, from a device that stayed asleep with browser open over a flight. If it caus [19:50:59] es them some pain, they can always close the browser? [19:51:39] ah, I figured out the issue I was encountering. HHVM, always populates $_POST from post body regradless of type, and the initial url parameter gets overriden by something that looks like a url parameter in the post body [19:52:00] we could cover some (mobile?) cases by hooking up to browser location detection too, but it tends to always do a popup question :( [19:59:06] bawolff: is what is "type" in that statement (method?)? IIRC post-params and query-params occupy the same virtual namespace anyways. [19:59:33] (03PS1) 10Dzahn: partman: delete the (8) unused recipes [puppet] - 10https://gerrit.wikimedia.org/r/306501 [19:59:37] bblack: I mean Content-Type header [19:59:59] The request has a body with a content-type: application/json [20:00:05] gwicke, cscott, arlolra, subbu, bearND, mdholloway, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160824T2000). [20:00:17] hhvm seems to consider it to be application/x-www-urlencoded [20:00:39] not for ores [20:00:48] tomorrow is very likely [20:01:07] which it breaks apart at & signs, which cause it to override the earlier parameter [20:01:27] Dereckson: deploying the ProofreadPage fix [20:01:42] bawolff: ah that's awful [20:02:15] !log hashar@tin Synchronized php-1.28.0-wmf.16/extensions/ProofreadPage/ProofreadPage.body.php: Fix hooks signatures T143817 (duration: 00m 47s) [20:02:16] T143817: Catchable fatal error: Argument 1 passed to ProofreadPage::onArticlePurge() must be an instance of Article, WikiPage given in /srv/mediawiki/php-1.28.0-wmf.16/extensions/ProofreadPage/ProofreadPage.body.php on line 444 - https://phabricator.wikimedia.org/T143817 [20:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:03:26] bawolff: seems worth trying that config flag [20:03:40] and yeah that's a crappy bug on the hhvm side [20:03:46] yeah, i was just trying to figure out what version of hhvm we are on [20:03:48] it seems weird that one reported bothered patching if the config flag is there [20:03:51] Dereckson: do you ahve access to logstash.wikimedia.org ? [20:03:54] !log T137474 Starting htmldumper in RESTBase Staging [20:03:55] T137474: Investigate lack of recency bias in Cassandra histogram metrics - https://phabricator.wikimedia.org/T137474 [20:03:58] config seems to be added in 3.10 [20:04:01] 06Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests, 13Patch-For-Review: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2580357 (10Dereckson) Poulpy from fr.wikipedia suggests to use the abbreviation pgc.wikimedia.org. That would let the other concurrent n... [20:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:04:12] is it possible with just the config flag, we fail to decode urlparams if no content-type at all, which is some legacy UA case? [20:04:23] oh, maybe just a question of version then [20:04:39] hashar: yes I've [20:05:01] Dereckson: I have pushed the patch on all servers so it should be fixed. I have no clue how to trigger it on wikisource though [20:05:07] maybe I can proof read a page :] [20:05:16] (03PS1) 10Dzahn: netboot: fix typo, hostname "rheniumi" is rhenium [puppet] - 10https://gerrit.wikimedia.org/r/306515 [20:05:18] bblack: Possible, although they would have failed back when we used zend [20:05:27] hashar: it's on mw1099 or in prod? [20:05:27] as zend does not decode in that case by default [20:05:30] ok [20:05:38] ah in prod [20:05:39] okay [20:05:50] prod [20:05:58] (03PS2) 10Dzahn: netboot: fix typo, hostname "rheniumi" is rhenium [puppet] - 10https://gerrit.wikimedia.org/r/306515 [20:06:07] (03CR) 10Dzahn: [C: 032] netboot: fix typo, hostname "rheniumi" is rhenium [puppet] - 10https://gerrit.wikimedia.org/r/306515 (owner: 10Dzahn) [20:06:14] The hacky fix would just be to add an & sign at the end of the url [20:06:29] since then the reflected url would not cause problems [20:06:43] bawolff: are you reproducing in prod, beta, mw-vagrant, ... [20:06:55] bd808: prod and vagrant [20:07:06] Dereckson: I should contribute to wikisource a little more only have a few changes ( https://fr.wikisource.org/wiki/Sp%C3%A9cial:Contributions/Hashar ) [20:07:25] Not caught in my initial testing since originally I was using zend with not-vagrant [20:07:26] If we have the patch that includes the AlwaysDecodePostData flag it should be easy to check if that fixes the problem [20:07:52] I'm discussing with Yannf for a testing strategy. [20:08:10] The "We have a hack from years ago..." summary sounds like a FB hack that is leaking to the world [20:09:16] (03PS1) 10Dzahn: typos: add 'wqds' for 'wdqs' [puppet] - 10https://gerrit.wikimedia.org/r/306516 [20:10:34] (03PS2) 10Dzahn: typos/jenkins: add 'wqds' as detectable typo [puppet] - 10https://gerrit.wikimedia.org/r/306516 [20:11:28] hashar, are you done with the train? services are on [20:11:51] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [20:12:05] (03PS4) 10Krinkle: Fix MWMultiVersion IDEA warnings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306460 (owner: 10Aaron Schulz) [20:12:41] (03CR) 10Krinkle: [C: 031] Fix MWMultiVersion IDEA warnings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306460 (owner: 10Aaron Schulz) [20:13:35] (03PS3) 10Dzahn: typos/jenkins: add 'wqds' as detectable typo [puppet] - 10https://gerrit.wikimedia.org/r/306516 [20:13:53] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [20:14:31] (03PS4) 10Dzahn: installserver: make chromium use raid1-1part recipe [puppet] - 10https://gerrit.wikimedia.org/r/306492 [20:14:52] (03CR) 10Dzahn: [C: 032] installserver: make chromium use raid1-1part recipe [puppet] - 10https://gerrit.wikimedia.org/r/306492 (owner: 10Dzahn) [20:15:12] yurik: yeah [20:15:29] yurik: just had to cherry pick a patch Dereckson wrote but that is complete as well [20:16:10] so please proceed ! :] [20:17:42] hashar: i wonder what we can do to make the post-merge build not fail, 'operations-puppet-doc'. should i make a ticket maybe? [20:18:33] the thing is "Could not generate documentation: No such file or directory - dummy.rb" [20:19:38] hashar, ok, i will push some maps stuff [20:29:57] mutante: T143233 [20:29:57] T143233: post build failures for operations/puppet on operations-puppet-doc - https://phabricator.wikimedia.org/T143233 [20:30:34] yurik: this afternoon I found a task related to enhance easytimeline / visual etc. I guess most can be closed since Graph is around :] [20:30:50] hashar, agree [20:31:15] i'm sure timeline could in theory do something that graph cannot, but its a more limited solution in general [20:31:28] !log deployed tilerator https://gerrit.wikimedia.org/r/#/c/306514/ [20:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:32:02] 06Operations: post build failures for operations/puppet on operations-puppet-doc - https://phabricator.wikimedia.org/T143233#2580449 (10bd808) Maybe @hashar or @zeljkofilipin can take a look and figure out how to run under a different ruby or exclude the file that `puppet doc` hates from being processed? [20:32:08] bd808: thanks!:) [20:32:16] hashar, do you know why in gerrit i can hit "review +2" button for tilerator deploy repo, but not for the kartotherian one? [20:32:27] e.g. https://gerrit.wikimedia.org/r/#/c/306514/ vs [20:32:28] https://gerrit.wikimedia.org/r/#/c/306518/ [20:32:44] permissions are off ? [20:32:51] yurik: no jenkins validation [20:32:55] 06Operations: post build failures for operations/puppet on operations-puppet-doc - https://phabricator.wikimedia.org/T143233#2561502 (10Dzahn) Yep, i noticed this too in operations/puppet. The issue seems to be the "Could not generate documentation: No such file or directory - dummy.rb" part. [20:32:56] hashar, its not a perm, i can still go to "replication" [20:33:05] bd808, oh, that might be it [20:33:09] you need to add a noop test by jenkins [20:33:16] hashar, sorry, not replication, to "reply" [20:33:43] ohhh [20:33:46] missing a v+2 [20:33:47] * yurik looks lost [20:34:23] we should get tests for them probably :d [20:34:37] will add the noop [20:34:52] something like I did here -- https://gerrit.wikimedia.org/r/#/c/303218/ [20:36:20] https://gerrit.wikimedia.org/r/306537 [20:36:57] yurik: the maps/* repo do not have any tests that CI could run ? [20:37:25] https://gerrit.wikimedia.org/r/#/q/project:maps/kartotherian https://gerrit.wikimedia.org/r/#/q/project:maps/tilerator [20:37:26] eek [20:37:37] hashar, not at the moment - we have some production-level tests that scap runs [20:37:53] smoke tests yeah [20:37:59] and there is a large number of unit tests in various sub repos [20:38:12] where most of the logic is [20:38:33] but tilerator & kartotherian are not themselves much code, they are simply shells for other stuff [20:39:00] then you could get jshint / yaml linting etc to run at least [20:39:18] jouncebot: next [20:39:18] and obviously get the unit tests run eg the ones at https://phabricator.wikimedia.org/diffusion/GMKT/browse/master/test/ [20:39:18] In 0 hour(s) and 20 minute(s): Tool Labs admin console ("Striker") (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160824T2100) [20:39:46] !log deployed kartotherian https://gerrit.wikimedia.org/r/#/c/306518/ [20:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:40:07] hashar, i agree, need to get at least that to run [20:40:32] probably very little value still, but those same style settings should run on all of kartotherian repos [20:44:27] (03PS1) 10Papaul: Adding MAC entries and install params for puppetmaster200[1-2] Bug:T143255 [puppet] - 10https://gerrit.wikimedia.org/r/306546 (https://phabricator.wikimedia.org/T143255) [20:49:07] 06Operations, 07Puppet, 10Continuous-Integration-Config: post build failures for operations/puppet on operations-puppet-doc - https://phabricator.wikimedia.org/T143233#2580487 (10hashar) Thanks for the task. It is due to what Bryan said: [[https://tickets.puppetlabs.com/browse/PUP-3261|puppet doc passes fil... [20:52:38] (03PS2) 10Dzahn: Adding MAC entries and install params for puppetmaster200[1-2] Bug:T143255 [puppet] - 10https://gerrit.wikimedia.org/r/306546 (https://phabricator.wikimedia.org/T143255) (owner: 10Papaul) [20:53:24] !log gave bd808 password for striker db on terbium [20:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:53:46] (03PS2) 10Yuvipanda: Add toolsadmin.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/305143 (https://phabricator.wikimedia.org/T136256) (owner: 10BryanDavis) [20:54:32] bd808 let's use this channel, so we can !log [20:54:38] let me know when you're ready for the logstash changes [20:54:39] (03CR) 10jenkins-bot: [V: 04-1] Add toolsadmin.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/305143 (https://phabricator.wikimedia.org/T136256) (owner: 10BryanDavis) [20:54:43] (03CR) 10Dzahn: [C: 032] Adding MAC entries and install params for puppetmaster200[1-2] Bug:T143255 [puppet] - 10https://gerrit.wikimedia.org/r/306546 (https://phabricator.wikimedia.org/T143255) (owner: 10Papaul) [20:55:10] what the fuck, jenkins? https://gerrit.wikimedia.org/r/#/c/305143/2 [20:55:16] it's rebased to head... [20:55:27] yuvipanda: the db password works. I'll create the tables now [20:55:36] bd808 ok! [20:55:45] (03CR) 10Yuvipanda: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/305143 (https://phabricator.wikimedia.org/T136256) (owner: 10BryanDavis) [20:57:10] !log Created initial tables for striker in striker db on m5-master [20:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:57:53] aaah, Depends-On! [20:58:23] if you want to do the dns first just amend that out of the commit message [20:58:30] (03PS3) 10BryanDavis: Add toolsadmin.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/305143 (https://phabricator.wikimedia.org/T136256) [20:58:36] (03PS3) 10Dzahn: recdns: remove chromium from LVS nameservers_override [puppet] - 10https://gerrit.wikimedia.org/r/306311 [20:59:02] !log T137474: Upgrading xenon.eqiad.wmnet to cassandra_2.2.6-wmf2 [20:59:03] T137474: Investigate lack of recency bias in Cassandra histogram metrics - https://phabricator.wikimedia.org/T137474 [20:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:59:39] yuvipanda: ready for logstash patches whenever you are [21:00:04] bd808 and yuvipanda: Dear anthropoid, the time has come. Please deploy Tool Labs admin console ("Striker") (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160824T2100). [21:00:16] way ahead of you jouncebot [21:02:05] bd808 :D ok, ready now! [21:02:36] bd808 do you want me to merge them together or with a gap? [21:02:42] (03PS3) 10Yuvipanda: logstash: Tag Striker messages for indexing [puppet] - 10https://gerrit.wikimedia.org/r/306055 (https://phabricator.wikimedia.org/T143172) (owner: 10BryanDavis) [21:02:59] how the fuck do I know what patch depensd on what?! [21:03:01] fuck you, gerrit [21:03:08] * yuvipanda rages for a little bit [21:03:53] its the "related changes" way over on the right [21:04:21] and I've not figured out if it is constantly ordered or not honestly [21:04:43] bd808 but that has *everything*. right now there's a change for MAC address for random nodes there [21:04:48] (03CR) 10Krinkle: "If we go ahead with this, we should also add built-in protection against production traffic. E.g. maybe ignore the function call on reques" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277585 (owner: 10Ori.livneh) [21:04:53] (03CR) 10Yuvipanda: [C: 032 V: 032] logstash: Tag Striker messages for indexing [puppet] - 10https://gerrit.wikimedia.org/r/306055 (https://phabricator.wikimedia.org/T143172) (owner: 10BryanDavis) [21:05:17] (03PS2) 10Yuvipanda: Revert "logstash: new input for msgpack over UDP" [puppet] - 10https://gerrit.wikimedia.org/r/306299 (owner: 10BryanDavis) [21:05:27] (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "logstash: new input for msgpack over UDP" [puppet] - 10https://gerrit.wikimedia.org/r/306299 (owner: 10BryanDavis) [21:06:00] (03PS4) 10Yuvipanda: logstash: Tag Striker messages for indexing [puppet] - 10https://gerrit.wikimedia.org/r/306055 (https://phabricator.wikimedia.org/T143172) (owner: 10BryanDavis) [21:06:06] (03CR) 10Yuvipanda: [V: 032] logstash: Tag Striker messages for indexing [puppet] - 10https://gerrit.wikimedia.org/r/306055 (https://phabricator.wikimedia.org/T143172) (owner: 10BryanDavis) [21:06:16] yuvipanda: together is fine. one logstash restart instead of 2 [21:06:23] (03CR) 10Dzahn: [C: 032] recdns: remove chromium from LVS nameservers_override [puppet] - 10https://gerrit.wikimedia.org/r/306311 (owner: 10Dzahn) [21:06:29] (03PS4) 10Dzahn: recdns: remove chromium from LVS nameservers_override [puppet] - 10https://gerrit.wikimedia.org/r/306311 [21:06:29] bd808 done [21:07:09] !log Forcing puppet run on logstash1001 [21:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:07:49] bd808 I'm gonna do the DNS change in the meantime. I prefer doing it early so we don't have to wait for TTLs before calling it done, but remember to *not* curl it until we are fully set up to prevent caching 404s :) [21:08:10] that's the trick for sure [21:08:25] (03CR) 10Yuvipanda: [C: 032] Add toolsadmin.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/305143 (https://phabricator.wikimedia.org/T136256) (owner: 10BryanDavis) [21:08:26] !log Ran DELETE FROM wbs_propertypairs WHERE pid1 = '641' on Wikidata for T132839 [21:08:27] T132839: Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839 [21:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:09:20] PROBLEM - puppet last run on xenon is CRITICAL: CRITICAL: Puppet has 1 failures [21:12:09] !log Forcing puppet run on logstash1002 [21:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:14:20] !log Forced puppet run on logstash1003 but cron had beaten me to it [21:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:15:29] ACKNOWLEDGEMENT - puppet last run on xenon is CRITICAL: CRITICAL: Puppet has 1 failures eevans Testing a patched package/build (T137474) - The acknowledgement expires at: 2016-08-27 21:14:41. [21:16:10] yuvipanda: ok. logstash resatarted clean everywhere and we are getting normal event volume in kibana [21:16:25] \o/ cool [21:17:12] time for the actual fun stuff? [21:17:21] yeah! [21:17:24] scap3 patch now, right? [21:17:34] should we monitor ORES clusters as well? [21:18:19] it's just two nodes, so might as well. I can take care of that. [21:18:22] bd808 ok time to merge [21:18:35] (03PS31) 10Yuvipanda: Provision Striker via scap3 [puppet] - 10https://gerrit.wikimedia.org/r/301505 (https://phabricator.wikimedia.org/T141014) (owner: 10BryanDavis) [21:18:40] (03CR) 10Yuvipanda: [C: 032 V: 032] Provision Striker via scap3 [puppet] - 10https://gerrit.wikimedia.org/r/301505 (https://phabricator.wikimedia.org/T141014) (owner: 10BryanDavis) [21:19:56] !log forcing puppet run on scb1001 [21:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:20:15] !log forcing puppet run on tin [21:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:22:23] bd808 halt! Aug 24 21:20:18 scb1001 uwsgi-ores[26509]: unable to find logger file [21:22:34] ores on scb1001 is dead, I disabled puppet on 1002 [21:22:41] frack [21:23:39] what does /etc/uwsgi/apps-enabled/ores.ini look like? [21:23:54] -logto=/srv/log/ores/main.log [21:23:55] +logger=file:/srv/log/ores/main.log [21:23:56] was the relevant diff [21:24:00] I've manually diffed it back [21:24:09] !log disable puppet on scb1001 and 1002 [21:24:14] let me do it on 200* as well [21:24:16] to prevent alerts [21:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:24:29] was there a "plugins=python, python3, logfile" change too? [21:24:34] nope [21:24:55] that would be the problem then... [21:24:58] plugins=python3,stats_pusher_statsd [21:25:14] !log disable puppet on scb2001 and 2002 as well [21:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:25:24] do they override plugins in their config instead of using need-plugins? [21:25:30] (03PS1) 10Brian Wolff: Set hhvm.virtual_host[default][always_decode_post_data] = false [puppet] - 10https://gerrit.wikimedia.org/r/306548 [21:25:50] * bd808 looks for ores uwsgi in ops/puppet [21:26:07] for --^. How does one go about testing puppet changes locally to make sure they do what i think they do? [21:26:56] bd808 yup [21:26:57] (03CR) 10jenkins-bot: [V: 04-1] Set hhvm.virtual_host[default][always_decode_post_data] = false [puppet] - 10https://gerrit.wikimedia.org/r/306548 (owner: 10Brian Wolff) [21:27:09] yuvipanda: yeah, that;s the problem. let me fix. [21:27:14] well, that's a good metric i did something stupid [21:27:17] !log depooling chromium for reinstall. scheduled downtime for host and service IPs [21:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:27:26] !log dzahn@palladium conftool action : set/pooled=no; selector: name=chromium.wikimedia.org [21:27:32] bawolff there's the puppet compiler, which tells you what changes. however that still doesn't tlel you how the software will react [21:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:28:06] bawolff https://wikitech.wikimedia.org/wiki/Puppet_Testing is the closest we have to 'documentation' I think [21:28:18] (03PS2) 10Jforrester: On public wikis, show "Publish" rather than "Save" on edit pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306303 (https://phabricator.wikimedia.org/T131132) [21:28:21] bd808 thanks! [21:29:48] (03PS1) 10BryanDavis: ores: Use need-plugins to load stats_pusher_statsd [puppet] - 10https://gerrit.wikimedia.org/r/306549 [21:29:53] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:53] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:30:33] uh [21:30:39] that should fix it yuvipanda [21:30:39] urandom gwicke ^ ? restbase? [21:31:04] yuvipanda: i'll have a look [21:31:10] thanks urandom [21:31:10] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:31:11] angry about ores maybe? [21:31:15] bd808 should python3be in there? [21:31:23] bd808 shouldn't be, since ORES isn't down [21:31:30] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:31:30] PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:31:30] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:31:32] puppet is disabled in the scb* nodes, that is all [21:31:51] yuvipanda: it's fine for python3 to be in both. I've tested that on striker [21:32:01] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [21:32:11] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [21:32:15] bd808 ok cool. [21:32:25] (03CR) 10Yuvipanda: [C: 032] ores: Use need-plugins to load stats_pusher_statsd [puppet] - 10https://gerrit.wikimedia.org/r/306549 (owner: 10BryanDavis) [21:32:25] probably not necessary for either project, but better explicit than implicit :) [21:32:35] (03PS2) 10Brian Wolff: Set hhvm.virtual_host[default][always_decode_post_data] = false [puppet] - 10https://gerrit.wikimedia.org/r/306548 [21:33:06] (03PS1) 10EBernhardson: Disable phrase suggester for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306552 (https://phabricator.wikimedia.org/T143260) [21:33:11] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [21:33:13] bd808 yeah, I'm going to merge, run/verify on scb2001 and see how that goes [21:33:21] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:33:30] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [21:33:30] RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [21:33:31] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [21:33:34] well, *this* one isn't me? [21:34:08] mobrovac do you know if mobileapps is still continuing to flap? [21:34:14] (03CR) 10jenkins-bot: [V: 04-1] Set hhvm.virtual_host[default][always_decode_post_data] = false [puppet] - 10https://gerrit.wikimedia.org/r/306548 (owner: 10Brian Wolff) [21:34:31] that was really weird [21:35:30] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [21:35:48] (03PS3) 10Brian Wolff: Set hhvm.virtual_host[default][always_decode_post_data] = false [puppet] - 10https://gerrit.wikimedia.org/r/306548 [21:37:10] bd808 ok, that seems to make ores happy [21:37:20] awesome [21:37:31] bd808 am going to verify this on other scb nodes. do you want to go ahead with scap stuff in th emeantime? [21:37:41] oh, no [21:37:45] I need to do stuff [21:37:55] yuvipanda: already {{done}} until we get to californium [21:38:01] bd808 re: striker::uwsgi::secret_config - I wonder if it should be using secret() instead [21:38:07] instead of hiera that is [21:38:22] it's all setup to use hiera [21:38:35] let me take a look [21:38:40] and that's much easier for Labs testing [21:38:41] I think the private repo doesn't really have any secrets in hiera [21:38:42] * yuvipanda verifies [21:38:50] that's true [21:39:07] oh no, we have stuff in hiera for secrets [21:39:08] nvm [21:39:21] :) I thought so [21:39:22] bd808 yeah, let's just use hiera for rn [21:39:38] bd808 I'll do it after doing the remaining 2 scb nodes [21:39:51] yup. sounds good [21:40:46] !log re-enabled puppet, ran puppet and verified ORES is ok on scb2001 / 2002 [21:40:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:45:22] !log re-enabled puppet, ran puppet and verified ORES is ok on scb1001 / 1002 [21:45:25] bd808 ok, so that's sorted \o/ [21:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:45:54] (03PS1) 10Dzahn: recdns: remove chromium from LVS nameservers_override [puppet] - 10https://gerrit.wikimedia.org/r/306555 [21:45:57] bd808 does secret key need to be just hex or something like that? [21:46:37] (03PS1) 10Krinkle: errorpage: Improve text flow and layout [puppet] - 10https://gerrit.wikimedia.org/r/306556 [21:46:53] I put one you can use in the file on terbium [21:46:57] it's just a string [21:47:02] (03CR) 10Dzahn: [C: 032] recdns: remove chromium from LVS nameservers_override [puppet] - 10https://gerrit.wikimedia.org/r/306555 (owner: 10Dzahn) [21:48:44] ah right [21:48:46] ok [21:50:44] !log running puppet on lvs servers, removing chromium from resolv.conf for reinstall [21:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:53:03] bd808 time to merge https://gerrit.wikimedia.org/r/#/c/305141/ [21:53:12] (03PS2) 10Yuvipanda: Provision Tool Labs admin console (Striker) on Californium [puppet] - 10https://gerrit.wikimedia.org/r/305141 (https://phabricator.wikimedia.org/T136256) (owner: 10BryanDavis) [21:53:21] (03CR) 10Yuvipanda: [C: 032 V: 032] Provision Tool Labs admin console (Striker) on Californium [puppet] - 10https://gerrit.wikimedia.org/r/305141 (https://phabricator.wikimedia.org/T136256) (owner: 10BryanDavis) [21:53:23] * bd808 squee [21:54:19] !log forcing puppet run on californium [21:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:54:41] bd808 ooooooo [21:54:43] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: OS Debian >= jessie required. at /etc/puppet/modules/striker/manifests/uwsgi.pp:32 on node californium.wikimedia.org [21:54:47] californium is trusty [21:54:50] (03CR) 10Krinkle: [C: 032] Fix MWMultiVersion IDEA warnings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306460 (owner: 10Aaron Schulz) [21:54:58] arrgh [21:55:09] that would have been good to know I guess [21:55:16] (03Merged) 10jenkins-bot: Fix MWMultiVersion IDEA warnings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306460 (owner: 10Aaron Schulz) [21:55:30] yeah :| [21:55:51] everything should work on trusty (my dev box is trusty) [21:55:57] but I'll need to rebuild the wheels [21:56:13] :( sorry! we should've flagged it for you [21:56:24] I can do that pretty quickly I think [21:56:27] (03PS1) 10Dzahn: recdns: re-add chromium to LVS nameservers_override [puppet] - 10https://gerrit.wikimedia.org/r/306559 [21:56:31] bd808 I'll make the puppet patch now in the meantime [21:57:37] !log krinkle@tin Synchronized multiversion/MWMultiVersion.php: Ie9c568a87ae - No-op clean-up (duration: 00m 49s) [21:57:43] AaronSchulz: ^ [21:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:58:51] PROBLEM - puppet last run on californium is CRITICAL: CRITICAL: puppet fail [21:58:58] yuvipanda: fix modules/striker/manifests/build.pp too plz in your patch [21:59:17] yeah ok [21:59:49] Krinkle: tx [22:00:41] (03CR) 10Brian Wolff: [C: 04-1] "I don't think this is actually right, since this doesn't enclose in square brackets. I'm not sure how to actually make it have the hhvm.vi" [puppet] - 10https://gerrit.wikimedia.org/r/306548 (owner: 10Brian Wolff) [22:02:31] is gerrit dead? [22:02:32] ooooh, no [22:02:33] my network [22:04:52] (03PS1) 10Yuvipanda: striker: Require trusty, not jessie [puppet] - 10https://gerrit.wikimedia.org/r/306563 [22:05:08] bd808 ^ +1? [22:05:38] !log stopping puppet and pdns-recursor on chromium [22:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:06:11] [22:06:31] then reinstalls the second DNS recursor [22:07:01] (03PS2) 10Yuvipanda: striker: Require trusty, not jessie [puppet] - 10https://gerrit.wikimedia.org/r/306563 [22:07:18] yuvipanda: ^ found a couple of packages that changed [22:07:39] not sure about the version numbers in uwsgi.pp either [22:08:44] I'll check that in a minute; building wheels right now [22:08:53] ok! [22:11:55] (03CR) 10Yuvipanda: [C: 032] striker: Require trusty, not jessie [puppet] - 10https://gerrit.wikimedia.org/r/306563 (owner: 10Yuvipanda) [22:12:20] 06Operations, 10ops-codfw: rack/setup/deploy puppetmaster200[12] - https://phabricator.wikimedia.org/T143255#2580737 (10Papaul) [22:14:04] (03PS2) 10Brian Wolff: Expand CSP report only test to elwiki. [puppet] - 10https://gerrit.wikimedia.org/r/306464 [22:14:25] (03CR) 10Brian Wolff: "Switched to elwiki." [puppet] - 10https://gerrit.wikimedia.org/r/306464 (owner: 10Brian Wolff) [22:14:27] (03CR) 10jenkins-bot: [V: 04-1] Expand CSP report only test to elwiki. [puppet] - 10https://gerrit.wikimedia.org/r/306464 (owner: 10Brian Wolff) [22:14:57] (03PS3) 10Brian Wolff: Expand CSP report only test to elwiki. [puppet] - 10https://gerrit.wikimedia.org/r/306464 [22:15:32] !log installing puppetmaster200[1-2] [22:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:16:04] (03PS2) 10BBlack: errorpage: Improve text flow and layout [puppet] - 10https://gerrit.wikimedia.org/r/306556 (owner: 10Krinkle) [22:16:13] (03CR) 10BBlack: [C: 032 V: 032] errorpage: Improve text flow and layout [puppet] - 10https://gerrit.wikimedia.org/r/306556 (owner: 10Krinkle) [22:17:04] bd808 fails right now with https://phabricator.wikimedia.org/P3887 btw [22:17:47] hmmm... ok [22:18:10] the clone is failing for ... unspecified reasons? [22:18:51] * thcipriani deployment lurks [22:19:09] did you run: scap deploy --init and check out submodules on the deployment host? [22:19:22] return code '70' ? [22:19:40] I did --init but I didn't submodule init [22:19:42] I'll fix in a minute [22:20:47] !log rebooting chromium into PXE [22:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:22:58] (03PS5) 10Alex Monk: Remove the hard-coded /a/mw-log references scattered around everywhere [puppet] - 10https://gerrit.wikimedia.org/r/305767 [22:23:39] ok... brand new untested wheels ready to test live in prod (as one does) [22:24:18] bah another small problem... [22:24:26] wrong hostname for californium [22:24:29] * bd808 fixes [22:25:01] PROBLEM - BGP status on cr2-knams is CRITICAL: BGP CRITICAL - AS13030/IPv4: Active [22:25:28] (03PS3) 10Alex Monk: mw-log-cleanup: remove wfDebug files in deployment-prep every week [puppet] - 10https://gerrit.wikimedia.org/r/305768 [22:27:39] !log Deployed striker (2fdf103) to californium [22:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:28:10] yuvipanda: scap3 failed with 22:26:39 ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'striker/deploy', '-g', 'default', 'promote'] on californium.wikimedia.org returned [70]: 22:26:16 INFO - Starting new HTTP connection (1): deployment.eqiad.wmnet [22:28:20] that looks like maybe firewall? not sure actually [22:28:25] thcipriani: ^ ideas [22:29:07] I'd try deploying from tin, watch the deployment-log [22:29:13] scap deploy-log -v [22:29:23] hmm [22:29:25] what does return code 70 mean? [22:29:26] didn't we check this maybe? [22:29:39] ah ha! "CalledProcessError: Command 'sudo /usr/sbin/service uwsgi-striker restart' returned non-zero exit status 1" [22:29:48] (03PS1) 10Dzahn: installserver: fix MAC for puppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/306573 [22:29:53] jsut wacky reporting without the really verbose loe [22:29:55] (log [22:29:57] right, because it isn't already there I guess [22:30:01] PROBLEM - Host 2620:0:861:2:208:80:154:157 is DOWN: PING CRITICAL - Packet loss = 100% [22:30:19] maybe? try another puppet run I guess? [22:30:21] return code 70 is an unhandled error [22:30:25] I fixed the submodule stuff [22:30:57] bd808 yup, started one [22:31:00] ACKNOWLEDGEMENT - Host 2620:0:861:2:208:80:154:157 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn reinstall [22:31:42] (03CR) 10Dzahn: [C: 032] "as pointed out by papaul, missing one character" [puppet] - 10https://gerrit.wikimedia.org/r/306573 (owner: 10Dzahn) [22:31:47] bd808 hmm, no change for iptables [22:32:09] yeah, my firewall thing was a red herring [22:32:28] either the sudoers rules aren't working or uwsgi isn't restarting [22:32:31] bd808 yeah. [22:32:36] (03CR) 10Alex Monk: "Note: Now that cleanups are running smoothly on deployment-fluorine02 this might not be strictly necessary" [puppet] - 10https://gerrit.wikimedia.org/r/305768 (owner: 10Alex Monk) [22:32:46] yuvipanda@californium:~$ sudo service uwsgi-striker status [22:32:47] uwsgi-striker: unrecognized service [22:33:10] I don't see a uwsgi-striker upstart file [22:33:18] I wonder if this is systemd only? Or that's systemd nameing [22:33:39] * bd808 tested it all on jessie :/ [22:33:53] I know that uwsgi::app isn't systemd only [22:34:06] RECOVERY - BGP status on cr2-knams is OK: BGP OK - up: 8, down: 0, shutdown: 2 [22:35:15] bd808 yup, second puppet run put the uwsgi config in there [22:35:26] bd808 and uwsgi-striker is up and running [22:35:47] bd808 nginx isn't tho [22:35:48] 2016/08/24 22:34:50 [emerg] 14325#0: bind() to 0.0.0.0:80 failed (98: Address already in use) [22:35:58] I guess apache or whatever horizon is running is already using port 80 [22:36:38] bblack: Thanks. How long for a deploy roughly? Or does it require a manual restart? [22:36:40] if horizon is using apache we can switch to that [22:36:53] (not urgent in the least ofc, just curious) [22:37:05] RECOVERY - puppet last run on californium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:37:18] yuvipanda: yeah. it's running apache [22:37:26] so ... [22:37:47] bd808 we could do that, or we could have nginx listen on a different port? [22:38:09] I suppose we could and also adjust the varnish forward [22:38:15] (03CR) 10jenkins-bot: [V: 04-1] Discovery stats module [puppet] - 10https://gerrit.wikimedia.org/r/305673 (https://phabricator.wikimedia.org/T143048) (owner: 10MaxSem) [22:38:17] oh wow, it's running mod_uwsg [22:38:17] i [22:38:26] * yuvipanda doens't touch that rn [22:38:28] I'm not sure if that would make anything funky or not [22:38:37] bd808 yeah, pretty sure that won't make anything funky [22:38:40] well [22:38:54] things that assume a particular port might, but they shouldn't care about nginx' port anyway [22:39:09] as long as django doesn't do something dumb when writing urls [22:39:54] yeah [22:40:18] (03PS5) 10MaxSem: Discovery stats module [puppet] - 10https://gerrit.wikimedia.org/r/305673 (https://phabricator.wikimedia.org/T143048) [22:40:23] the nginx port can jsut be changed with hiera [22:40:30] easy enough to test out [22:40:54] got a random open port there to place it on? [22:41:12] bd808 8044? [22:41:47] (03CR) 10Dzahn: "duplicate of https://gerrit.wikimedia.org/r/#/c/287145/" [dns] - 10https://gerrit.wikimedia.org/r/287593 (https://phabricator.wikimedia.org/T134360) (owner: 10Jcrespo) [22:42:10] bblack: thx for the info and apologies for the delay in replying... Yeah I have heard of a few cases... there are some folks who rarely restart their computers and/or often restore previous browser sessions... Also IIRC there was a bug on Safari whereby session cookies stuck around after the browser was restarted, but I can't find the link to that now. Anyway, yeah, fully agreed that it's not an [22:42:11] urgent concern at all, just thought I'd mention the circumstances in which some different approach might be worthwhile (short cookie lifetime sounds good) :) thx again! [22:42:49] (03PS1) 10BryanDavis: striker: move nginx to another port [puppet] - 10https://gerrit.wikimedia.org/r/306574 [22:43:08] (03Abandoned) 10Dzahn: Remove DNS entries of db1058 [dns] - 10https://gerrit.wikimedia.org/r/287145 (https://phabricator.wikimedia.org/T134360) (owner: 10Southparkfan) [22:43:18] (03CR) 10Yuvipanda: [C: 032 V: 032] striker: move nginx to another port [puppet] - 10https://gerrit.wikimedia.org/r/306574 (owner: 10BryanDavis) [22:43:35] (03CR) 10Dzahn: [C: 04-2] "already done by Chris in commit 2016979ded611256e5f4b321" [dns] - 10https://gerrit.wikimedia.org/r/287593 (https://phabricator.wikimedia.org/T134360) (owner: 10Jcrespo) [22:43:42] (03PS2) 10Dzahn: Remove db1058 entries [dns] - 10https://gerrit.wikimedia.org/r/287593 (https://phabricator.wikimedia.org/T134360) (owner: 10Jcrespo) [22:43:47] !log force a puppet run on californium [22:43:54] (03Abandoned) 10Dzahn: Remove db1058 entries [dns] - 10https://gerrit.wikimedia.org/r/287593 (https://phabricator.wikimedia.org/T134360) (owner: 10Jcrespo) [22:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:44:15] yuvipanda: hmmm how do we put a weird port into modules/role/manifests/cache/misc.pp ? [22:44:47] 06Operations, 10ops-eqiad, 13Patch-For-Review: Decommission broken db1058 - https://phabricator.wikimedia.org/T134360#2263194 (10Dzahn) I have abandoned 2 pending changes in DNS repo for this, that were already duplicate by Chris' change. Just cleaning up. [22:44:48] I think we have to add a new backend maybe [22:44:49] bd808 there's an example in there I remember seeing [22:45:09] i think you can use any port, but you cant have 2 different ones on the same backend [22:45:14] it goes in be_opts, but not sure if that can be per host [22:45:18] yeah [22:45:20] ok [22:45:26] so a new backend [22:45:30] easy enough [22:45:58] 06Operations, 10ops-eqiad: Decommission broken db1058 - https://phabricator.wikimedia.org/T134360#2580842 (10Dzahn) [22:47:09] bd808 btw, I can hit the service and get a response! [22:47:11] it's a 500 tho [22:47:26] django.db.utils.Error: {'desc': "Can't contact LDAP server", 'info': 'A TLS packet with unexpected length was received.'} [22:47:42] hmmm [22:47:58] do we not have tls for internal ldap? [22:49:28] !log chromium - revoking and re-signing puppet certs, salt keys, initial puppet run.. [22:49:28] bd808 if I hand hack it to be 'ldap' instead of 'ldaps' it works [22:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:49:46] ah. ok [22:49:54] 06Operations, 06WMF-Legal, 06WMF-NDA-Requests: ZhouZ needs access to WMF-NDA group - https://phabricator.wikimedia.org/T98722#2580844 (10ZhouZ) So to recap what we want to do: The process would be for any staff member requesting NDA access for their NDA account to create a phabricator task using their accou... [22:49:55] bd808 so I guess we don't have TLS? [22:50:10] that's a simple hiera change [22:50:16] bd808 /etc/ldap.conf doesn't have ldaps [22:50:20] no, ye have ldaps in other places [22:50:24] *nod* [22:50:44] like for example when icing atalks to ldap-labs, it's ldaps:// [22:50:46] (03PS2) 10BryanDavis: Add toolsadmin.wikimedia.org to misc varnish [puppet] - 10https://gerrit.wikimedia.org/r/305142 (https://phabricator.wikimedia.org/T136256) [22:50:56] (03CR) 10Alex Monk: [C: 032] [Beta Cluster] Remove PoolCounter override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298919 (https://phabricator.wikimedia.org/T38891) (owner: 10Greg Grossmeier) [22:52:01] mutante bd808 so if I specify ldaps and then do not specify port, it works [22:52:02] (03PS3) 10BryanDavis: Add toolsadmin.wikimedia.org to misc varnish [puppet] - 10https://gerrit.wikimedia.org/r/305142 (https://phabricator.wikimedia.org/T136256) [22:52:03] well, at least it says so in the AuthLDAPURL string [22:52:04] (03CR) 10Alex Monk: [C: 032] deployment-prep: Move poolcounter to deployment-poolcounter02 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305837 (owner: 10Alex Monk) [22:52:40] yuvipanda: oh, maybe it flubbed the port number for ldaps [22:52:42] maybe it was ldap but on the ldaps port [22:52:44] s/it/I/ [22:52:51] so I suspect that the issue is that LDAPS is on a different port number [22:53:13] yeah. if it works without the port number that's great [22:53:18] (03PS2) 10Alex Monk: deployment-prep: Move poolcounter to deployment-poolcounter02 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305837 [22:53:47] 3269 [22:53:49] (03CR) 10Alex Monk: deployment-prep: Move poolcounter to deployment-poolcounter02 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305837 (owner: 10Alex Monk) [22:53:58] (03CR) 10Alex Monk: [C: 032] deployment-prep: Move poolcounter to deployment-poolcounter02 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305837 (owner: 10Alex Monk) [22:54:07] bd808 we also need to open a firewall hole for 8044 [22:54:12] let me take care of that [22:54:24] (03Merged) 10jenkins-bot: deployment-prep: Move poolcounter to deployment-poolcounter02 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305837 (owner: 10Alex Monk) [22:54:26] (03PS4) 10Alex Monk: [Beta Cluster] Remove PoolCounter override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298919 (https://phabricator.wikimedia.org/T38891) (owner: 10Greg Grossmeier) [22:54:45] (03CR) 10Alex Monk: [Beta Cluster] Remove PoolCounter override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298919 (https://phabricator.wikimedia.org/T38891) (owner: 10Greg Grossmeier) [22:54:58] (03CR) 10BBlack: [C: 04-1] "Multiple ports on the same host *probably* doesn't work yet. We've hacked around similar issues in the past by adding a new aliased hostn" [puppet] - 10https://gerrit.wikimedia.org/r/305142 (https://phabricator.wikimedia.org/T136256) (owner: 10BryanDavis) [22:55:27] 389 - LDAP 636 - LDAPS [22:55:48] 3268/3269 Microsoft LDAP/AD [22:55:57] (03CR) 10Alex Monk: [C: 032] [Beta Cluster] Remove PoolCounter override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298919 (https://phabricator.wikimedia.org/T38891) (owner: 10Greg Grossmeier) [22:56:24] (03PS1) 10BryanDavis: striker: Fix ldaps port number [puppet] - 10https://gerrit.wikimedia.org/r/306576 [22:56:27] (03Merged) 10jenkins-bot: [Beta Cluster] Remove PoolCounter override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298919 (https://phabricator.wikimedia.org/T38891) (owner: 10Greg Grossmeier) [22:56:39] bblack: thanks for that tip. [22:56:57] (03CR) 10Dzahn: [C: 031] "ack, 636 = LDAPS" [puppet] - 10https://gerrit.wikimedia.org/r/306576 (owner: 10BryanDavis) [22:58:06] !log krenair@tin Synchronized wmf-config/LabsServices.php: labs-only change: https://gerrit.wikimedia.org/r/305837 (duration: 00m 48s) [22:58:11] (03CR) 10Dzahn: [C: 032] striker: Fix ldaps port number [puppet] - 10https://gerrit.wikimedia.org/r/306576 (owner: 10BryanDavis) [22:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:58:21] thanks mutante [22:58:44] bd808: yuvipanda: do we need to defer SWAT a little bit (there are two changes, one for mediawiki-config, one for the core), or you're doing puppet changes only? [22:58:52] bd808: yeah it's not a true varnish problem or anything. it's just a quirk of our insane VCL backend templating systems. they traditionally end up using the first label of the hostname as an identifier that clashes. We've extended that to the whole FQDN now, but not the port number :) [22:59:10] !log krenair@tin Synchronized wmf-config/InitialiseSettings-labs.php: labs-only change: https://gerrit.wikimedia.org/r/298919 (duration: 00m 46s) [22:59:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:59:20] it seems like such a simple problem, but it runs deep [22:59:31] Dereckson I think you can go ahead, I'm doing puppet changes only. [22:59:37] ok [22:59:39] Dereckson: you're fine. we are not messing with MW stuff [22:59:58] bblack: :) the story of VCL [23:00:05] RoanKattouw, ostriches, MaxSem, and Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160824T2300). [23:00:05] James_F and Dereckson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:11] * James_F waves. [23:00:26] Hi James_F. [23:00:29] I can SWAT this evening. [23:00:32] Kk. [23:00:37] I keep reading (Max 8 patches) as 'MaxSem 8 patches' [23:00:52] nooooooooooö [23:01:09] xD [23:01:33] (03PS3) 10Dereckson: On public wikis, show "Publish" rather than "Save" on edit pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306303 (https://phabricator.wikimedia.org/T131132) (owner: 10Jforrester) [23:01:44] hi. i just added some last-minute patches for the SWAT. [23:01:58] PROBLEM - Host 2620:0:861:2:7a2b:cbff:fe08:aa48 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:861:2:7a2b:cbff:fe08:aa48 [23:02:07] Dereckson: Note that my config change is in advance of the train (intentionally). [23:02:18] Dereckson: So there shouldn't be any change at all after it's merged. :-) [23:02:26] they're debugging for unreproducible errors, so i won't be able to tell if they work (other than 'yes, this does not break everything') [23:02:27] (Except on Beta Cluster.) [23:02:39] ACKNOWLEDGEMENT - Host 2620:0:861:2:7a2b:cbff:fe08:aa48 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:861:2:7a2b:cbff:fe08:aa48 daniel_zahn + up ip addr add 2620:0:861:2:208:80:154:157/64 dev eth0 [23:03:56] James_F: https://phabricator.wikimedia.org/T131132 is pretty controversial [23:04:13] there are still fresh opposition (7 days ago) [23:04:16] bd808 do you have a wip patch for the vcl stuff? or should I take a look / attempt to move it to apache? [23:04:17] (03PS1) 10BryanDavis: Add service name californium8044 for californium.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/306582 (https://phabricator.wikimedia.org/T138546) [23:04:26] ah :) [23:04:28] Dereckson: It's been announced multiple times. There will always be opposition. [23:04:47] Dereckson: There's also been lots of support. Unanimity is not the criterion. [23:05:30] (03PS4) 10BryanDavis: Add toolsadmin.wikimedia.org to misc varnish [puppet] - 10https://gerrit.wikimedia.org/r/305142 (https://phabricator.wikimedia.org/T136256) [23:05:36] that's an old one too, https://usability.wikimedia.org/wiki/Usability_study_features [23:06:02] Dereckson: This is not the venue to discuss product concerns. If you're not comfortable merging it, I can get someone else to do it instead, no worries. :-) [23:06:09] yuvipanda: so we can either do that service name or move it to apache [23:06:16] which seems less gross? [23:06:47] -1 per risker [23:07:18] I'm not doing anything too fancy in the nginx config, so porting to apache might be more sane [23:07:40] jsut so have one less random weird thing (or 3) about the service [23:08:11] bd808 yeah, I agree. [23:08:22] Dereckson is right, there is strong opposition [23:08:23] James_F: and announced in tech news in April, looks good [23:08:28] but then [23:08:35] mutante: Not really. [23:08:37] bd808 in the super long run the right thing to do would be to have horizon alos use uwsgi and then just use nginx [23:08:40] users seems to have read the tech news and visited the task [23:08:48] *also [23:08:50] mutante: And again, this is not remotely the right venue to discuss this. [23:08:56] yuvipanda: well... [23:09:39] yuvipanda: should we call a halt for today, decide on which way to flop things, and do it tomorrow? [23:09:59] bd808 that works for me too, although I wouldn't mind continuing if you want to. [23:10:23] moving horizon to uwsgi is probably pretty easy; not sure what andrewbogott thinks about that [23:10:39] Dereckson: you can recuse yourself if you want to, someone else may need to deploy it later then [23:10:55] yuvipanda: I've got time, but I'm not sure we have consensus on the "right" plan [23:11:14] Yeah, I'm understanding if you feel uncomfortable. Please don't feel pressured into doing something you're not happy with. [23:11:16] bd808 I think the extra CNAME is too hacky, so I'd prefer to not do that. [23:11:17] (03CR) 10Dereckson: [C: 04-1] "I'd recommend to defer this change. It's an old request, one I like personally, and this has been announced in tech news in April with fai" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306303 (https://phabricator.wikimedia.org/T131132) (owner: 10Jforrester) [23:11:27] yuvipanda: agreed. it looks gross [23:11:44] so either both to apache or both to nginx [23:11:44] bd808 so then it's just apache vs nginx [23:11:52] jinx! [23:11:55] Dereckson: ... but that's not quite the same thing as pushing against it. :-( [23:12:35] (03Abandoned) 10BryanDavis: Add service name californium8044 for californium.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/306582 (https://phabricator.wikimedia.org/T138546) (owner: 10BryanDavis) [23:12:56] 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure, 07Technical-Debt, 07Tracking: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220#2580877 (10AlexMonk-WMF) [23:13:11] James_F: I *personnaly* like the "publish" wording, but I noticed on the task some arguments were valuable and not really addressed (I sympathize with the trouble for that kind of changes, there are good venues for bikeshedding) [23:13:40] Dereckson: if you just wanted to pass, no need to -1 it in gerrit [23:13:43] Dereckson: It's pretty rude in your comment to assume that those aren't already taken into account. :-( [23:14:14] (03CR) 10Jforrester: "> I'd recommend to defer this change. It's an old request, one I like personally, and this has been announced in tech news in April with f" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306303 (https://phabricator.wikimedia.org/T131132) (owner: 10Jforrester) [23:16:50] there was somethign else from MatmaRex ? [23:16:51] greg-g: the CR -1 is to reflect the opposition on the task [23:19:16] greg-g: hm? [23:19:44] MatmaRex: you had patches for SWAT, I was just trying to get us moving along again :) [23:20:21] oh, indeed. [23:20:23] ah thanks greg-g, didn't notice the new additions [23:20:34] (03PS1) 10MaxSem: Disable on most Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306587 [23:21:01] MatmaRex: jouncebot doesn't announce last minute patches, don't hesitate to notify #wikimedia-operations when you do a last minute addition [23:21:04] (03CR) 10jenkins-bot: [V: 04-1] Disable on most Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306587 (owner: 10MaxSem) [23:22:08] i mentioned it (but didn't ping anyone, i guess) [23:22:30] (03PS2) 10MaxSem: Disable on most Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306587 [23:24:49] Fix ProofreadPage::updatePrIndex signature live on mw1099, I've ask a wikisource sysop (yannf) to test if all works as expected now. [23:25:16] we're waiting qunit tests for the UW changes [23:28:23] 306571 looks good [23:29:02] 06Operations, 13Patch-For-Review: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562#2580897 (10Dzahn) [23:29:28] (03PS13) 10Yuvipanda: Introduce 'clush' module and toollabs role [puppet] - 10https://gerrit.wikimedia.org/r/305804 [23:30:00] 06Operations, 13Patch-For-Review: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562#2338953 (10Dzahn) root@chromium:~# grep active /proc/mdstat md1 : active (auto-read-only) raid1 sda2[0] sdb2[1] md0 : active raid1 sda1[0] sdb1[1] [23:30:18] MatmaRex: live on mw1099 [23:31:01] (and I confirm you mentioned it while I was looking to Gerrit) [23:32:12] Dereckson: thanks. UW seems to be fine with it [23:32:26] ok [23:32:52] scapping 306571, then the UW changes [23:33:11] !log dereckson@tin Synchronized php-1.28.0-wmf.16/extensions/ProofreadPage/ProofreadPage.body.php: Fix ProofreadPage::updatePrIndex signature (T143817) (duration: 00m 50s) [23:33:12] T143817: Catchable fatal error: Argument 1 passed to ProofreadPage::onArticlePurge() must be an instance of Article, WikiPage given in /srv/mediawiki/php-1.28.0-wmf.16/extensions/ProofreadPage/ProofreadPage.body.php on line 444 - https://phabricator.wikimedia.org/T143817 [23:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:34:39] 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure, 07Technical-Debt, 07Tracking: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220#2580903 (10Krenair) [23:35:13] !log dereckson@tin Synchronized php-1.28.0-wmf.16/extensions/UploadWizard/resources/: More debug logging for Firefox's 'NS_ERROR_NOT_AVAILABLE' exceptions (T136831) (duration: 00m 47s) [23:35:14] T136831: NS_ERROR_NOT_AVAILABLE - https://phabricator.wikimedia.org/T136831 [23:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:35:22] PROBLEM - NTP peers on chromium is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [23:36:30] !log dereckson@tin Synchronized php-1.28.0-wmf.15/extensions/UploadWizard/resources: More debug logging for Firefox's 'NS_ERROR_NOT_AVAILABLE' exceptions (T136831) (duration: 00m 49s) [23:36:31] T136831: NS_ERROR_NOT_AVAILABLE - https://phabricator.wikimedia.org/T136831 [23:36:35] MatmaRex: in prod [23:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:37:25] Dereckson: thank you [23:37:58] You're welcome [23:39:02] !log chromium - install ntpdate, stop ntp, sync time with hydrogen, start ntp, remove ntpdate [23:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:43:08] !log dzahn@palladium conftool action : set/pooled=yes; selector: name=chromium.wikimedia.org [23:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:45:13] RECOVERY - NTP peers on chromium is OK: NTP OK: Offset -0.00151 secs [23:46:05] yay [23:50:19] 06Operations: Migrate hydrogen/chromium to jessie - https://phabricator.wikimedia.org/T123727#2580958 (10Dzahn) 21:27 mutante: depooling chromium for reinstall. scheduled downtime for host and service IPs 21:50 mutante: running puppet on lvs servers, removing chromium from resolv.conf for reinstall 22:05 mutante... [23:51:22] (03PS2) 10Dzahn: recdns: re-add chromium to LVS nameservers_override [puppet] - 10https://gerrit.wikimedia.org/r/306559 [23:52:31] (03PS4) 10Dzahn: recdns: re-add chromium to LVS nameservers_override [puppet] - 10https://gerrit.wikimedia.org/r/306559 [23:53:20] (03PS1) 10Yuvipanda: tools: Don't have static inherit toollabs baggage [puppet] - 10https://gerrit.wikimedia.org/r/306588 [23:55:41] (03CR) 10Dzahn: [C: 032] "putting it back in production" [puppet] - 10https://gerrit.wikimedia.org/r/306559 (owner: 10Dzahn) [23:55:59] (03PS2) 10Yuvipanda: tools: Don't have static inherit toollabs baggage [puppet] - 10https://gerrit.wikimedia.org/r/306588 [23:56:07] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Don't have static inherit toollabs baggage [puppet] - 10https://gerrit.wikimedia.org/r/306588 (owner: 10Yuvipanda) [23:56:40] (03CR) 10Liuxinyu970226: [C: 031] On public wikis, show "Publish" rather than "Save" on edit pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306303 (https://phabricator.wikimedia.org/T131132) (owner: 10Jforrester) [23:58:52] (03PS5) 10BryanDavis: Add toolsadmin.wikimedia.org to misc varnish [puppet] - 10https://gerrit.wikimedia.org/r/305142 (https://phabricator.wikimedia.org/T136256) [23:59:27] (03CR) 10BryanDavis: "PS5 is a revert to PS1. We are going to fix the port problem on californium." [puppet] - 10https://gerrit.wikimedia.org/r/305142 (https://phabricator.wikimedia.org/T136256) (owner: 10BryanDavis)