[00:20:25] (03PS1) 10Tim Landscheidt: sudo: Use validate_cmd for validating sudoers files [puppet] - 10https://gerrit.wikimedia.org/r/326376 [00:38:41] (03PS2) 10Tim Landscheidt: Set SYS_UID_MAX and SYS_GID_MAX to 499 [puppet] - 10https://gerrit.wikimedia.org/r/326311 (https://phabricator.wikimedia.org/T45795) [00:39:38] (03CR) 10jenkins-bot: [V: 04-1] Set SYS_UID_MAX and SYS_GID_MAX to 499 [puppet] - 10https://gerrit.wikimedia.org/r/326311 (https://phabricator.wikimedia.org/T45795) (owner: 10Tim Landscheidt) [00:39:52] (03CR) 10Tim Landscheidt: [C: 04-1] "(I tested the file_line logic to work properly.)" [puppet] - 10https://gerrit.wikimedia.org/r/326311 (https://phabricator.wikimedia.org/T45795) (owner: 10Tim Landscheidt) [00:42:17] (03PS3) 10Tim Landscheidt: Set SYS_UID_MAX and SYS_GID_MAX to 499 [puppet] - 10https://gerrit.wikimedia.org/r/326311 (https://phabricator.wikimedia.org/T45795) [01:03:33] (03PS1) 10Tim Landscheidt: Tools: Remove bashisms from clush [puppet] - 10https://gerrit.wikimedia.org/r/326379 [01:03:35] (03PS1) 10Tim Landscheidt: Tools: Quote arguments in clush [puppet] - 10https://gerrit.wikimedia.org/r/326380 [01:05:39] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 3 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [01:32:39] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [01:36:14] (03PS4) 10BryanDavis: l10nupdate: acquire scap lock before changing files [puppet] - 10https://gerrit.wikimedia.org/r/303923 (https://phabricator.wikimedia.org/T72752) [01:43:49] 06Operations, 06Labs, 13Patch-For-Review: audit labs versus production ssh keys - https://phabricator.wikimedia.org/T108078#2864021 (10RobH) a:05RobH>03None [02:01:39] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [02:29:39] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [03:35:30] (03Abandoned) 10Tim Landscheidt: sudo: Use validate_cmd for validating sudoers files [puppet] - 10https://gerrit.wikimedia.org/r/326376 (owner: 10Tim Landscheidt) [03:38:49] PROBLEM - puppet last run on elastic1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:47:49] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:47:59] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:48:39] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [03:48:49] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [03:55:39] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:04:35] 07Puppet, 06Labs: Make changing puppetmasters for Labs instances more easy - https://phabricator.wikimedia.org/T152941#2864105 (10scfc) [04:06:49] RECOVERY - puppet last run on elastic1022 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [04:08:59] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=5509.80 Read Requests/Sec=2487.00 Write Requests/Sec=0.80 KBytes Read/Sec=37831.60 KBytes_Written/Sec=161.60 [04:15:59] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=71.30 Read Requests/Sec=0.50 Write Requests/Sec=3.00 KBytes Read/Sec=8.00 KBytes_Written/Sec=66.80 [04:23:39] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [05:10:19] PROBLEM - puppet last run on elastic1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:31:49] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [05:31:49] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [05:39:19] RECOVERY - puppet last run on elastic1024 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [05:48:49] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [05:48:49] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [05:57:19] PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:25:19] RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:35:40] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[ethtool] [06:42:29] PROBLEM - High load average on labstore1003 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [06:43:29] RECOVERY - High load average on labstore1003 is OK: OK: Less than 50.00% above the threshold [16.0] [06:45:19] PROBLEM - puppet last run on db1044 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[zsh-beta] [06:52:39] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "This needs to be done carefully and surely depend on the context: we don't want the jobrunners/videoscalers (to the very least) to time ou" [puppet] - 10https://gerrit.wikimedia.org/r/326144 (https://phabricator.wikimedia.org/T97192) (owner: 10Mark Bergsma) [06:55:12] 06Operations: Cron conflict for kafkatee logrotate on oxygen - https://phabricator.wikimedia.org/T151748#2864154 (10elukey) ``` elukey@oxygen:~$ sudo invoke-rc.d kafkatee reload elukey@oxygen:~$ echo $? 102 ``` [06:55:19] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [07:02:39] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [07:09:00] 06Operations, 10MediaWiki-API, 10Parsoid, 10RESTBase, and 6 others: HHVM request timeouts not working; support lowering the API request timeout per request - https://phabricator.wikimedia.org/T97192#2864158 (10Joe) >>! In T97192#2860191, @Anomie wrote: > Tried my code from T97192#1237258 with `hhvm.server.... [07:12:19] RECOVERY - puppet last run on db1044 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [07:12:36] 06Operations: Cron conflict for kafkatee logrotate on oxygen - https://phabricator.wikimedia.org/T151748#2864159 (10elukey) I agree with Riccardo, we'd probably need to merge the two rotate scripts. The `kafkatee` is shipped with a logrotate script, so we'd need to remove it from there and only let puppet do the... [07:12:55] (03PS1) 10Marostegui: db-codfw.php: Depool db2064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326388 (https://phabricator.wikimedia.org/T151552) [07:13:56] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326388 (https://phabricator.wikimedia.org/T151552) (owner: 10Marostegui) [07:14:31] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326388 (https://phabricator.wikimedia.org/T151552) (owner: 10Marostegui) [07:17:39] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2064 - T151552 (duration: 02m 21s) [07:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:53] T151552: Import S2,S6,S7,m3 and x1 to dbstore2001 and dbstore2002 - https://phabricator.wikimedia.org/T151552 [07:22:19] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [07:30:07] !log Stop replication db2064 for maintenance - T151552 [07:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:20] T151552: Import S2,S6,S7,m3 and x1 to dbstore2001 and dbstore2002 - https://phabricator.wikimedia.org/T151552 [07:33:29] PROBLEM - High load average on labstore1003 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [24.0] [07:39:29] RECOVERY - High load average on labstore1003 is OK: OK: Less than 50.00% above the threshold [16.0] [07:57:29] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:25:29] PROBLEM - puppet last run on mc1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:26:29] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [08:45:08] (03CR) 10Alexandros Kosiaris: [C: 032] Add otrs-wiki.m.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/326291 (https://phabricator.wikimedia.org/T152870) (owner: 10Reedy) [08:49:54] 06Operations, 10ops-eqiad, 06Services (watching): scb1003, scb1004 exhibit temperature problems - https://phabricator.wikimedia.org/T150882#2864231 (10akosiaris) 05Open>03Resolved Both servers have not spewed any warning during the last 2 days, I am happily gonna resolve this. Thanks @Cmjohnson ! [08:53:21] !log akosiaris@puppetmaster1001 conftool action : set/weight=15; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mobileapps']) [08:53:22] !log akosiaris@puppetmaster1001 conftool action : set/weight=15; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mobileapps']) [08:53:26] !log akosiaris@puppetmaster1001 conftool action : set/weight=15; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=citoid']) [08:53:29] !log akosiaris@puppetmaster1001 conftool action : set/weight=15; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=citoid']) [08:53:29] RECOVERY - puppet last run on mc1006 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [08:53:30] !log akosiaris@puppetmaster1001 conftool action : set/weight=15; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=graphoid']) [08:53:32] !log akosiaris@puppetmaster1001 conftool action : set/weight=15; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=graphoid']) [08:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:37] !log akosiaris@puppetmaster1001 conftool action : set/weight=15; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mathoid']) [08:53:40] !log akosiaris@puppetmaster1001 conftool action : set/weight=15; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=mathoid']) [08:53:41] !log akosiaris@puppetmaster1001 conftool action : set/weight=15; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=cxserver']) [08:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:45] !log akosiaris@puppetmaster1001 conftool action : set/weight=15; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=cxserver']) [08:53:48] !log akosiaris@puppetmaster1001 conftool action : set/weight=15; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=apertium']) [08:53:51] !log akosiaris@puppetmaster1001 conftool action : set/weight=15; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=apertium']) [08:53:54] !log akosiaris@puppetmaster1001 conftool action : set/weight=15; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=eventstreams']) [08:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:55] !log akosiaris@puppetmaster1001 conftool action : set/weight=15; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=eventstreams']) [08:54:00] !log akosiaris@puppetmaster1001 conftool action : set/weight=15; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=pdfrender']) [08:54:02] !log akosiaris@puppetmaster1001 conftool action : set/weight=15; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=pdfrender']) [08:54:05] !log akosiaris@puppetmaster1001 conftool action : set/weight=15; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=ores']) [08:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:08] !log akosiaris@puppetmaster1001 conftool action : set/weight=15; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=ores']) [08:54:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:10] !log increase by 50% the weight of scb1003, scb1004 for most services on it now that they no longer exhibit temperature problems. These boxes are more powerful then scb1001, scb1002 and should be able to serve more requests. T150882 [08:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:45] T150882: scb1003, scb1004 exhibit temperature problems - https://phabricator.wikimedia.org/T150882 [09:01:27] <_joe_> wat? [09:02:30] (03PS1) 10Alexandros Kosiaris: Revert "Add otrs-wiki.m.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/326397 [09:02:32] (03PS1) 10Alexandros Kosiaris: Actually add otrs-wiki.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/326398 (https://bugzilla.wikimedia.org/152870) [09:04:24] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "Add otrs-wiki.m.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/326397 (owner: 10Alexandros Kosiaris) [09:04:32] (03CR) 10Alexandros Kosiaris: [C: 032] Actually add otrs-wiki.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/326398 (https://bugzilla.wikimedia.org/152870) (owner: 10Alexandros Kosiaris) [09:05:51] !log Deploy alter table db1049 (master) dewiki.revision - T148967 [09:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:02] T148967: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967 [09:10:14] ah there you go [09:10:24] something was missing in my Monday [09:11:34] _joe_: yeah... turns out a lot of our boxes have thermal issues [09:12:03] up to now they are usually easily fixed by some thermal paste [09:13:35] these 2 (scb1003, scb1004) were throttling themselves down, failing to serve requests at some point. Usually escaped icinga cause they would cool down and be able once more to serve enough requests [09:14:52] 06Operations, 07Puppet, 06Labs, 10Phabricator, and 2 others: Install Arcanist in toollabs::dev_environ - https://phabricator.wikimedia.org/T139738#2864252 (10hashar) Puppet patch https://gerrit.wikimedia.org/r/#/c/297975/ will add arcanist on tools development environment for Trust and Jessie. Precise is... [09:16:11] (03CR) 10Hashar: [C: 031] "T94792 is about removing Precise support in tools-labs. People are already instructed to migrate to Trusty/Jessie. If one really wanted " [puppet] - 10https://gerrit.wikimedia.org/r/297975 (https://phabricator.wikimedia.org/T139738) (owner: 10Dereckson) [09:17:27] 06Operations: Cron conflict for kafkatee logrotate on oxygen - https://phabricator.wikimedia.org/T151748#2864257 (10elukey) Myself from the past created T145490 [09:20:33] akosiaris: hello. Don't you get an Icinga probe to monitor thermal status nowadays ? [09:21:09] I clearly remember brandon/chris talked about ages ago [09:21:49] ah that is https://phabricator.wikimedia.org/T125205 , proposing to add a probe based on ipmi [09:25:19] hashar: yup, there are some concerns over in that task IIRC, still under consideration though [09:25:39] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:25:53] akosiaris: yeah I can imagine. I was merely thinking out loud trying to figure out whether I have dreamed about that task or if it was really a thing :D [09:26:15] :-) [09:26:19] PROBLEM - puppet last run on graphite1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:26:22] the reality and dream realm somehow overlaps in my brain :/ [09:27:52] (03CR) 10DCausse: [C: 031] Use logstash's prune filter for api-feature-usage-sanitized [puppet] - 10https://gerrit.wikimedia.org/r/313035 (owner: 10Anomie) [09:29:49] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:29:49] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:29:50] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:32:40] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [09:32:49] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [09:32:59] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [09:33:59] 07Puppet, 10Beta-Cluster-Infrastructure: puppet failure on deployment-phab01 ... is not a Hash. It looks to be a Array at /etc/puppet/modules/phabricator/manifests/init.pp:68 - https://phabricator.wikimedia.org/T147818#2864262 (10hashar) [09:35:26] (03PS1) 10Hashar: phabricator: fix passing config on labs [puppet] - 10https://gerrit.wikimedia.org/r/326401 (https://phabricator.wikimedia.org/T147818) [09:38:03] (03CR) 10Hashar: "That should fix the puppet failure on deployment-phab01 / deployment-phab02 , though I haven't cherry picked it on the puppet master." [puppet] - 10https://gerrit.wikimedia.org/r/326401 (https://phabricator.wikimedia.org/T147818) (owner: 10Hashar) [09:38:15] 07Puppet, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: puppet failure on deployment-phab01 ... is not a Hash. It looks to be a Array at /etc/puppet/modules/phabricator/manifests/init.pp:68 - https://phabricator.wikimedia.org/T147818#2864279 (10hashar) a:03hashar [09:39:49] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:40:19] (03PS6) 10Marostegui: check_mariadb.pl: Fix small display issue [puppet] - 10https://gerrit.wikimedia.org/r/326124 (https://phabricator.wikimedia.org/T152766) [09:41:39] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1002.eqiad.wmnet because of too many down! [09:41:51] (03CR) 10Marostegui: [C: 032] check_mariadb.pl: Fix small display issue [puppet] - 10https://gerrit.wikimedia.org/r/326124 (https://phabricator.wikimedia.org/T152766) (owner: 10Marostegui) [09:42:39] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [09:43:39] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1002.eqiad.wmnet because of too many down! [09:44:39] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [09:51:29] PROBLEM - High load average on labstore1003 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [09:52:29] RECOVERY - High load average on labstore1003 is OK: OK: Less than 50.00% above the threshold [16.0] [09:53:39] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [09:55:19] RECOVERY - puppet last run on graphite1003 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [09:56:29] PROBLEM - puppet last run on mw1260 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:56:33] (03CR) 10Ema: [C: 032] dstat_varnishstat: remove varnish 3 compatibility code [puppet] - 10https://gerrit.wikimedia.org/r/326247 (https://phabricator.wikimedia.org/T150660) (owner: 10Ema) [09:56:41] (03PS2) 10Ema: dstat_varnishstat: remove varnish 3 compatibility code [puppet] - 10https://gerrit.wikimedia.org/r/326247 (https://phabricator.wikimedia.org/T150660) [09:56:48] (03CR) 10Ema: [V: 032 C: 032] dstat_varnishstat: remove varnish 3 compatibility code [puppet] - 10https://gerrit.wikimedia.org/r/326247 (https://phabricator.wikimedia.org/T150660) (owner: 10Ema) [10:03:25] (03PS2) 10Gehel: node service - allow empty entry point [puppet] - 10https://gerrit.wikimedia.org/r/324190 (https://phabricator.wikimedia.org/T150021) [10:04:08] jouncebot: next [10:04:08] In 2 hour(s) and 55 minute(s): Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161212T1300) [10:06:33] (03CR) 10Gehel: [C: 032] node service - allow empty entry point [puppet] - 10https://gerrit.wikimedia.org/r/324190 (https://phabricator.wikimedia.org/T150021) (owner: 10Gehel) [10:07:49] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [10:14:39] (03CR) 10Gehel: tilerator: deploy config with scap3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/324761 (https://phabricator.wikimedia.org/T150021) (owner: 10Gehel) [10:15:22] (03PS2) 10Gehel: tilerator: deploy config with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/324761 (https://phabricator.wikimedia.org/T150021) [10:16:10] (03CR) 10jenkins-bot: [V: 04-1] tilerator: deploy config with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/324761 (https://phabricator.wikimedia.org/T150021) (owner: 10Gehel) [10:18:04] (03PS3) 10Gehel: tilerator: deploy config with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/324761 (https://phabricator.wikimedia.org/T150021) [10:18:11] (03PS2) 10Jcrespo: [WIP] Create framework to transfer files over the LAN [puppet] - 10https://gerrit.wikimedia.org/r/326155 [10:19:09] (03CR) 10jenkins-bot: [V: 04-1] tilerator: deploy config with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/324761 (https://phabricator.wikimedia.org/T150021) (owner: 10Gehel) [10:19:22] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Create framework to transfer files over the LAN [puppet] - 10https://gerrit.wikimedia.org/r/326155 (owner: 10Jcrespo) [10:20:13] (03PS4) 10Gehel: tilerator: deploy config with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/324761 (https://phabricator.wikimedia.org/T150021) [10:23:25] 06Operations, 10DNS, 10Traffic, 07Mobile: Many misc wikis lack mobile domains - https://phabricator.wikimedia.org/T152882#2864342 (10akosiaris) [10:23:28] 06Operations, 10DNS, 10Traffic, 07Mobile, 13Patch-For-Review: OTRS-Wiki link to mobile website - https://phabricator.wikimedia.org/T152870#2864339 (10akosiaris) 05Open>03Resolved a:03akosiaris Change merged, reverted, correct one merged. Links now work, resolving [10:26:29] RECOVERY - puppet last run on mw1260 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [10:32:28] (03PS3) 10Jcrespo: [WIP] Create framework to transfer files over the LAN [puppet] - 10https://gerrit.wikimedia.org/r/326155 [10:33:27] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Create framework to transfer files over the LAN [puppet] - 10https://gerrit.wikimedia.org/r/326155 (owner: 10Jcrespo) [10:38:49] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:43:56] (03PS4) 10Jcrespo: [WIP] Create framework to transfer files over the LAN [puppet] - 10https://gerrit.wikimedia.org/r/326155 [10:44:54] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Create framework to transfer files over the LAN [puppet] - 10https://gerrit.wikimedia.org/r/326155 (owner: 10Jcrespo) [10:52:48] (03PS5) 10Jcrespo: [WIP] Create framework to transfer files over the LAN [puppet] - 10https://gerrit.wikimedia.org/r/326155 [10:53:35] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Create framework to transfer files over the LAN [puppet] - 10https://gerrit.wikimedia.org/r/326155 (owner: 10Jcrespo) [10:54:22] (03CR) 10Paladox: "We should just move those hosts to the main phabricator class as it now works on labs." [puppet] - 10https://gerrit.wikimedia.org/r/326401 (https://phabricator.wikimedia.org/T147818) (owner: 10Hashar) [10:54:29] PROBLEM - High load average on labstore1003 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [24.0] [10:55:26] (03PS6) 10Jcrespo: [WIP] Create framework to transfer files over the LAN [puppet] - 10https://gerrit.wikimedia.org/r/326155 [10:56:30] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Create framework to transfer files over the LAN [puppet] - 10https://gerrit.wikimedia.org/r/326155 (owner: 10Jcrespo) [10:57:29] RECOVERY - High load average on labstore1003 is OK: OK: Less than 50.00% above the threshold [16.0] [11:03:29] 06Operations, 06Operations-Software-Development: E901 SyntaxError: invalid syntax is wrongly raised on using python's abc by jenkins python CI linter - https://phabricator.wikimedia.org/T152950#2864426 (10jcrespo) [11:04:29] 07Puppet, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: puppet failure on deployment-phab01 ... is not a Hash. It looks to be a Array at /etc/puppet/modules/phabricator/manifests/init.pp:68 - https://phabricator.wikimedia.org/T147818#2704482 (10Paladox) Should probably switch these hosts to use the mai... [11:06:49] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [11:06:59] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:07:49] PROBLEM - puppet last run on kubernetes1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:21:29] PROBLEM - High load average on labstore1003 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [24.0] [11:25:29] RECOVERY - High load average on labstore1003 is OK: OK: Less than 50.00% above the threshold [16.0] [11:34:49] RECOVERY - puppet last run on kubernetes1004 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [11:35:59] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [11:57:39] PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:10:17] (03CR) 10Odder: [C: 04-1] "I'm boldly -1ing this as I really think this requires a wider discussion among the community." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307475 (https://phabricator.wikimedia.org/T144254) (owner: 10Urbanecm) [12:22:35] (03PS1) 10Jdrewniak: Bumping Portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326408 (https://phabricator.wikimedia.org/T128546) [12:25:39] RECOVERY - puppet last run on lvs1003 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [12:26:37] (03PS2) 10Odder: Add localized logo for Gujarati Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323236 (https://phabricator.wikimedia.org/T121853) [12:28:20] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:33:15] (03PS1) 10Odder: Set $wgCategoryCollation for Finnish Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326409 (https://phabricator.wikimedia.org/T151570) [12:34:59] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Zotero alive) is CRITICAL: Test Zotero alive returned the unexpected status 404 (expecting: 200) [12:35:49] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [12:37:29] PROBLEM - puppet last run on radium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:40:31] (03CR) 10Dereckson: [C: 031] Set $wgCategoryCollation for Finnish Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326409 (https://phabricator.wikimedia.org/T151570) (owner: 10Odder) [12:41:13] (03CR) 10Dereckson: [C: 031] Alias from WP to NS_PROJECT in kuwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326246 (https://phabricator.wikimedia.org/T152815) (owner: 10Urbanecm) [12:47:28] !log Created Translate tables on no.wikimedia (T152490) [12:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:40] T152490: Enable the Translate extension on Wikimedia Norge's wiki - https://phabricator.wikimedia.org/T152490 [12:48:14] jouncebot: refresh [12:48:16] I refreshed my knowledge about deployments. [12:51:05] (03CR) 10Dereckson: [C: 031] "This change is ready for deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323834 (https://phabricator.wikimedia.org/T149002) (owner: 10Niharika29) [12:53:19] (03PS1) 10ArielGlenn: some pylint of script that produces list of lst good dumps [puppet] - 10https://gerrit.wikimedia.org/r/326413 (https://phabricator.wikimedia.org/T152954) [12:53:40] * bawolff sad about the black and white logo [12:53:45] * bawolff likes colours [12:54:03] what black and white thing? [12:54:44] https://gerrit.wikimedia.org/r/307475 [12:55:04] The WMF logo becoming plain black [12:55:57] huh someone -1'd it there at the end [12:56:20] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [12:56:30] (03PS1) 10ArielGlenn: more pylint of script that produces list of last good dumps [puppet] - 10https://gerrit.wikimedia.org/r/326415 (https://phabricator.wikimedia.org/T152954) [12:57:21] realistically, odder's -1 is probably not going to stop it [12:57:47] I'm more sad about the decision, not the fact that people are implementing the decision [12:58:26] (03PS1) 10ArielGlenn: still more pylint of script that produces list of last good dumps [puppet] - 10https://gerrit.wikimedia.org/r/326416 (https://phabricator.wikimedia.org/T152954) [12:59:14] did that have to do with acessibility? I sort of remember an internal discussion about the colors [13:00:04] aude: Respected human, time to deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161212T1300). Please do the needful. [13:00:19] * aude waves [13:02:09] (03PS1) 10ArielGlenn: list last n good dumps: add (unimplemented) option for dumping for rsyncers [puppet] - 10https://gerrit.wikimedia.org/r/326417 (https://phabricator.wikimedia.org/T152954) [13:02:52] (03PS1) 10Giuseppe Lavagetto: puppetmaster: add script to generate and sign ECDSA certificates [puppet] - 10https://gerrit.wikimedia.org/r/326418 [13:04:10] (03CR) 10jenkins-bot: [V: 04-1] list last n good dumps: add (unimplemented) option for dumping for rsyncers [puppet] - 10https://gerrit.wikimedia.org/r/326417 (https://phabricator.wikimedia.org/T152954) (owner: 10ArielGlenn) [13:05:30] RECOVERY - puppet last run on radium is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [13:06:55] (03PS2) 10ArielGlenn: list last n good dumps: add (unimplemented) option for dumping for rsyncers [puppet] - 10https://gerrit.wikimedia.org/r/326417 (https://phabricator.wikimedia.org/T152954) [13:11:33] !log Created initial bureaucrat account for Edjoerv on ec.wikimedia (T135521) [13:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:45] T135521: Internal Wiki for Wikimedians of Ecuador - https://phabricator.wikimedia.org/T135521 [13:12:26] (03PS1) 10Addshore: Revert "Disable ElectronPdfService on mw.org until messages are fixed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326420 [13:13:07] (03PS2) 10Addshore: Enable ElectronPdfService extension on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326420 (https://phabricator.wikimedia.org/T150944) [13:13:14] (03PS3) 10Addshore: Enable ElectronPdfService extension on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326420 (https://phabricator.wikimedia.org/T150944) [13:14:46] (03PS1) 10Aude: Enable statements parser function and lua on WikibaseClient wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326421 (https://phabricator.wikimedia.org/T152780) [13:16:47] (03CR) 10Aude: [C: 032] Enable statements parser function and lua on WikibaseClient wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326421 (https://phabricator.wikimedia.org/T152780) (owner: 10Aude) [13:17:20] (03Merged) 10jenkins-bot: Enable statements parser function and lua on WikibaseClient wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326421 (https://phabricator.wikimedia.org/T152780) (owner: 10Aude) [13:22:02] checking on mwdebug [13:22:34] [= [13:25:02] it's a kitten :) [13:25:05] it works [13:25:46] (03PS1) 10ArielGlenn: list last n good dumps: implement rsynclisting option [puppet] - 10https://gerrit.wikimedia.org/r/326422 (https://phabricator.wikimedia.org/T152954) [13:27:03] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: add script to generate and sign ECDSA certificates [puppet] - 10https://gerrit.wikimedia.org/r/326418 (owner: 10Giuseppe Lavagetto) [13:27:30] !log aude@tin Synchronized wmf-config/Wikibase.php: Enable Wikibase statements parser function and lua T152780 (duration: 00m 52s) [13:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:42] T152780: deploy new statement parser function and Lua function - https://phabricator.wikimedia.org/T152780 [13:29:44] !log aude@tin Synchronized wmf-config/Wikibase-production.php: Remove test wiki settings for statements parser function and lua (duration: 00m 56s) [13:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:15] !log aude@tin Synchronized wmf-config/Wikibase-labs.php: Remove beta wiki settings for statements parser function and lua (duration: 00m 47s) [13:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:07] done [13:40:05] jouncebot: next [13:40:05] In 0 hour(s) and 19 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161212T1400) [13:40:13] zeljkof: looks easy today :} [13:40:19] [= [13:40:29] or not [13:40:46] I looked at the list earlier today and there were way less changes [13:40:53] zeljkof: I guess we can both do it [13:41:45] the portals change can be deployed independently/in parallel [13:41:49] hashar: how do we split? [13:41:56] the first change in the list requires some maintenance script to be run [13:42:22] though of course change "910246" does not exist :D [13:44:18] zeljkof: I think you should handle the namespace alias change by Urbanecm https://gerrit.wikimedia.org/r/#/c/326246/ [13:44:32] zeljkof: requires a maintenance script to be run against the wiki. That is a good training [13:45:08] zeljkof: basically they make it so a page like  [[WP:Foobar]] is equivalent to [[Wikipedia:Foobar]] [13:45:22] and if there are any page named WP:XXXX we gotta move them [13:45:48] since currently they are in the article namespace ( NS_MAIN = 0) in the database they look like: namespace=0 page_title=WP:XXXX [13:46:04] and they have to be migrated to : namespace=NS_PROJECT page_title=XXXX [13:46:16] hashar, am I needed? [13:46:43] Urbanecm: I have updated the deploy page for the NS alias change. I am assuming its change https://gerrit.wikimedia.org/r/#/c/326246/ for kuwiki [13:47:24] hashar, of course, sorry for my typo :D [13:47:59] zeljkof: unless you are busy and I can take it all :} [13:48:43] hashar: well, probably both of us have other stuff to do, we can split the swat and get it done quicker [13:48:58] I'm just not sure how to do it without stepping on each other feet [13:49:55] hashar, Urbanecm: which script needs to run? https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Maintenance_Scripts [13:50:05] * zeljkof is confused [13:50:35] that would be maintenance/namespaceDupes.php [13:50:53] it find duplicates / rename article as needed [13:51:02] basically [13:51:11] (03PS1) 10Ema: Update default config file [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/326426 [13:51:22] once the namespace alias has been created ( for 'WP' => NS_PROJECT) [13:51:51] the script find all articles for which the title starts with 'WP' and will try to move it under the NS_PROJECT namespace [13:52:17] but maybe that needs a bit too much mediawiki knowledge,I don't mind dealing with it [13:52:22] so, I do the deployment as usual, then run the script? [13:52:27] yeah [13:52:39] ok, sounds easy [13:52:46] until the script explode hehe [13:52:50] never done it before, as far as I remember [13:53:33] pagelinks from=39013 ns=0 dbk=Portal:Bûyerên_rojane -> Portal:Bûyerên_rojane DRY RUN [13:53:34] pagelinks from=19994 ns=0 dbk=Portal:Erdnîgarî -> Portal:Erdnîgarî DRY RUN [13:53:38] it is not always run apparently [13:53:48] mwscript namespaceDupes.php kuwiki --fix [13:54:02] Probably this one [13:54:04] zeljkof, [13:54:36] Urbanecm: thanks [13:54:39] and usually I copy paste the script output on the task [13:55:06] !log Fixing duplicate kuwiki articles T39521 [13:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:17] T39521: Change namespaces configuration - ku.wikipedia - https://phabricator.wikimedia.org/T39521 [13:56:52] zeljkof: example of the script output https://phabricator.wikimedia.org/T152815#2864678 [13:57:36] going to rebase them all [13:58:25] perfect timing "IRCCloud system message: We're migrating your user account to a different server to balance the system. Your account will be offline briefly before reconnecting automatically. We apologise for the disruption." [13:58:42] haha [13:59:19] rebase inc [13:59:37] (03PS2) 10Hashar: Alias from WP to NS_PROJECT in kuwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326246 (https://phabricator.wikimedia.org/T152815) (owner: 10Urbanecm) [13:59:39] (03PS2) 10Hashar: [cirrus] enable BM25 on all but wikis with spaceless languages [step 2/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324752 (https://phabricator.wikimedia.org/T152092) (owner: 10DCausse) [13:59:41] (03PS2) 10Hashar: Bumping Portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326408 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [13:59:41] I don't see anything in zuul [13:59:43] (03PS3) 10Hashar: Reconfigure interface editor group on ur.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285209 (https://phabricator.wikimedia.org/T133564) (owner: 10Dereckson) [13:59:45] (03PS2) 10Hashar: Set language links order for no.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326360 (https://phabricator.wikimedia.org/T148021) (owner: 10Dereckson) [13:59:47] (03PS2) 10Hashar: Enable Translate on no.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326359 (https://phabricator.wikimedia.org/T152490) (owner: 10Dereckson) [13:59:49] oh, here it is [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161212T1400). Please do the needful. [14:00:04] Urbanecm, dcausse, jan_drewniak, Dereckson, and Addshore: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:00:09] for Translate, tables have already been created [14:00:10] (03PS4) 10Hashar: Enable ElectronPdfService extension on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326420 (https://phabricator.wikimedia.org/T150944) (owner: 10Addshore) [14:00:14] I did them this mornin [14:00:14] g [14:00:16] awesome [14:00:24] that is for the no.wikimedia.org right ? [14:00:28] aye [14:00:31] o/ [14:00:32] o/ [14:00:35] *waves* [14:00:36] so that is only a slice of traffic [14:00:40] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:01:02] hashar: zeljkof you should be able to push mine straight out (skipping mwdebug) [14:01:11] (03CR) 10Hashar: [C: 032] "SWAT (fr:GIGN)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326359 (https://phabricator.wikimedia.org/T152490) (owner: 10Dereckson) [14:01:16] (03CR) 10Hashar: [C: 032] "SWAT (fr:GIGN)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326360 (https://phabricator.wikimedia.org/T148021) (owner: 10Dereckson) [14:01:19] o/ [14:01:20] (03CR) 10Hashar: [C: 032] "SWAT (fr:GIGN)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285209 (https://phabricator.wikimedia.org/T133564) (owner: 10Dereckson) [14:01:25] (03CR) 10Hashar: [C: 032] "SWAT (fr:GIGN)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326408 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [14:01:29] (03CR) 10Hashar: [C: 032] "SWAT (fr:GIGN)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324752 (https://phabricator.wikimedia.org/T152092) (owner: 10DCausse) [14:01:33] (03CR) 10Hashar: [C: 032] "SWAT (fr:GIGN)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326246 (https://phabricator.wikimedia.org/T152815) (owner: 10Urbanecm) [14:01:37] (03CR) 10Hashar: [C: 032] "SWAT (fr:GIGN)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326420 (https://phabricator.wikimedia.org/T150944) (owner: 10Addshore) [14:01:42] 06Operations: Revoke accounts (NDA Audit 2016) - https://phabricator.wikimedia.org/T152957#2864688 (10faidon) [14:01:44] hashar: should I start with 326246? it's rebased [14:01:58] eyah [14:02:05] I went made and just merged everything [14:02:10] grr [14:02:35] (03PS2) 10Ema: Update default config file [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/326426 [14:02:47] 06Operations, 06Labs: Explore hosting the multimedia commons use case - https://phabricator.wikimedia.org/T152632#2864700 (10zhuyifei1999) [[https://commons.wikimedia.org/wiki/Help:Server-side_upload#What_to_do_if_files_represent_hundred_of_GB_to_several_TB.3F|commons:Help:Server-side upload#What to do if file... [14:03:17] (03Merged) 10jenkins-bot: Alias from WP to NS_PROJECT in kuwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326246 (https://phabricator.wikimedia.org/T152815) (owner: 10Urbanecm) [14:03:20] We need a kanban board to report what has been tested for the merge them all and test together workflow [14:03:38] (03Merged) 10jenkins-bot: [cirrus] enable BM25 on all but wikis with spaceless languages [step 2/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324752 (https://phabricator.wikimedia.org/T152092) (owner: 10DCausse) [14:03:43] (03CR) 10Elukey: [C: 031] Update default config file [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/326426 (owner: 10Ema) [14:03:45] difficult to track cross live on mwdebug1002 / works / syncing for 8 patches [14:03:55] unless you pull them one by one on tin [14:03:57] Urbanecm: can 326246 be tested at mwdebug1002? [14:04:01] indeed [14:04:21] (03Merged) 10jenkins-bot: Bumping Portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326408 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [14:04:21] zeljkof: 326246 you can sync it up [14:04:29] It should be testable zeljkof [14:04:48] hashar: should I run "git fetch" on tin, as usual? [14:04:54] (03Merged) 10jenkins-bot: Reconfigure interface editor group on ur.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285209 (https://phabricator.wikimedia.org/T133564) (owner: 10Dereckson) [14:05:08] will do [14:05:17] done [14:05:17] (03Merged) 10jenkins-bot: Set language links order for no.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326360 (https://phabricator.wikimedia.org/T148021) (owner: 10Dereckson) [14:05:18] * zeljkof is confused with all patches being merged [14:05:28] you can pull on the mwdebug test [14:05:38] (03Merged) 10jenkins-bot: Enable Translate on no.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326359 (https://phabricator.wikimedia.org/T152490) (owner: 10Dereckson) [14:05:41] I rebased tin to get two patches [14:05:46] the one for namespace [14:05:49] and the one for dcausse [14:06:05] (03Merged) 10jenkins-bot: Enable ElectronPdfService extension on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326420 (https://phabricator.wikimedia.org/T150944) (owner: 10Addshore) [14:06:38] the idea is to not have to bother with CI / CR+2 etc [14:06:45] and just rebase on tin patch by path [14:06:47] zfilipin@mwdebug1002:~$ scap pull [14:07:10] dcausse: Urbanecm your changes are on mwdebug wikis :} [14:07:17] "scap pull" still running... [14:07:28] hashar: testing [14:07:36] hashar, thanks. Going to check. [14:08:11] dcausse: that shift some traffic to codfw doesn't it ? [14:08:16] 'wmgCirrusSearchDefaultCluster' => [ [14:08:16] - 'default' => 'local', [14:08:16] + 'default' => 'codfw', [14:08:31] hashar: nearly all search traffic yes [14:08:46] hashar: "scap pull" is still running, it is surprisingly slow today :| [14:08:46] * hashar watch Dallas area becoming warmer [14:09:01] zeljkof, hashar it seems it works. Can it be deployed to the whole cluster? [14:09:24] zeljkof: it is rebuilding the l10n files apparently [14:09:36] (03CR) 10Jcrespo: [C: 031] mariadb: Added gtid_domain_id to its own variable [puppet] - 10https://gerrit.wikimedia.org/r/326086 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [14:09:44] hashar: should I deploy to cluster? [14:09:51] or wait it to finish? [14:09:59] wait for it [14:10:08] ok, done [14:10:17] deploying to cluster [14:10:18] and wanna hold with dcausse change [14:11:02] preparing ad127cc - Bumping Portals to master [14:11:07] ok, wait, can I deploy wmf-config/InitialiseSettings.php to the cluster? [14:11:41] zeljkof: once dcausse has validated yes [14:11:47] hashar, zeljkof: mine tested on mwdebug1002 and looks good [14:11:56] ok, then, deploying [14:12:02] ;:} [14:12:21] hashar: you got me all confused :P [14:12:32] just have to scap sync-file wmf-config/InitialiseSettings.php [14:12:44] yeah sorry :( [14:12:49] could be mitigated with a kanban merged / on tin / on mwdebug1002 / live [14:12:52] should have kept the one by one workflow probably [14:12:54] and then run the script, right? [14:13:05] between on mwdebug1002 and live a "tested" [14:13:08] zeljkof: yes [14:14:11] jan_drewniak: your portals update should be on mwdebug1002 now [14:14:41] jan_drewniak: oh that is just some statistics :} [14:14:45] (03PS9) 10Alexandros Kosiaris: Add profile::kubernetes::node profile class [puppet] - 10https://gerrit.wikimedia.org/r/324212 [14:14:47] (03PS8) 10Alexandros Kosiaris: Include ::profile::kubernetes::node in role::kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/324213 [14:14:49] (03PS1) 10Alexandros Kosiaris: kube-scheduler: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/326429 [14:14:51] (03PS1) 10Alexandros Kosiaris: k8s::controller: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/326430 [14:15:16] hashar: yup, nothing fancy today - looks good [14:15:34] https://www.wikipedia.org/ gives me 400 Bad Request though [14:15:52] oh no [14:15:56] that is a local issue sorry [14:15:58] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:326246|Alias from WP to NS_PROJECT in kuwiki (T152815)]] (duration: 00m 45s) [14:16:06] hashar: works for me [14:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:10] T152815: Kurdish Wikipedia: Create a redirection to the namespace {{ns: 4}} - https://phabricator.wikimedia.org/T152815 [14:16:34] scappino portals [14:16:36] dcausse: live [14:16:49] Urbanecm: deployed, running the script... [14:17:10] jan_drewniak: your change is live [14:17:14] !log hashar@tin Synchronized portals/prod/wikipedia.org/assets: (no message) (duration: 00m 47s) [14:17:17] hashar: thanks, looking. [14:17:22] (03PS2) 10Marostegui: mariadb: Added gtid_domain_id to its own variable [puppet] - 10https://gerrit.wikimedia.org/r/326086 (https://phabricator.wikimedia.org/T149418) [14:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:02] !log hashar@tin Synchronized portals: (no message) (duration: 00m 48s) [14:18:11] jan_drewniak: that purged a single url [14:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:39] Dereckson: "Reconfigure interface editor group on ur.wikipedia" is on mwdebug1002 now [14:18:43] testing [14:19:22] zeljkof, hashar: everything looks good so far, thanks! (I'll continue to monitor the cluster) [14:19:34] dcausse: awesome :-} [14:19:43] Dereckson: I have also pushed on mwdebug1002 the Set language links order for no.wikipedia [14:20:05] hashar: "Reconfigure interface editor group on ur.wikipedia" works [14:20:34] hashar: "Set language links order for no.wikipedia" works [14:20:52] syncing those [14:21:11] addshore: you and your Electrons will be last :D [14:21:17] okay! [14:21:34] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 45s) [14:21:36] not sure why, the script finished with "Oh noees" :| https://phabricator.wikimedia.org/T152815#2864753 [14:21:36] gotta enable Translate for no.wikimedia.org first [14:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:14] Dereckson: mwdebug1002 should have Translate for no.wikimedia.org [14:22:24] ah, "id=49093 ns=0 dbk=WP:HOTCAT *** dest title exists and --add-prefix not specified" [14:22:39] (03CR) 10Urbanecm: "As you can see at T144254 something like discussion is in progress." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307475 (https://phabricator.wikimedia.org/T144254) (owner: 10Urbanecm) [14:22:40] yeah duplicate artciesl [14:23:08] zeljkof: so they have a NS_MAIN article named "WP:HOTCAT" and a NS_PROJECT article named "HOTCAT" [14:23:15] hashar: yeah, Translate there, works [14:23:18] they both show up as [[WP:HOTCAT]] hence a conflict [14:23:57] the script has an option --suffix to rename the article automatically [14:24:03] the renames will have to be posted on the tasks [14:24:08] hashar: makes sense, should I do that? [14:24:09] and the community ill figure out which one is a dupe [14:24:09] --merge gives good results too [14:24:13] oh [14:24:22] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Enable Translate on no.wikimedia - T152490 (duration: 00m 45s) [14:24:31] https://phabricator.wikimedia.org/T152815#2864759 [14:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:35] generally there are the same page at several era, so merge histories makes sense [14:24:35] T152490: Enable the Translate extension on Wikimedia Norge's wiki - https://phabricator.wikimedia.org/T152490 [14:24:49] addshore: it is all your :} [14:25:02] hashar, is 326246 deployed? [14:25:04] hashar: all done? [14:25:06] addshore: patch has been fetched on tin, you "just" have to rebase [14:25:25] cool! [14:25:34] Urbanecm: yes but the maintenance namespace dupe script still has to be run / is running [14:25:50] hashar, why this is three-people SWAT? :) [14:26:24] parallelism ! [14:26:40] and more ways to break the whole cluster? [14:27:09] If you think it's useful... [14:27:21] bbl [14:28:10] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: {{gerrit|326420}} T150944 Enable ElectronPdfService extension on mw.org (duration: 00m 45s) [14:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:21] T150944: Deploy ElectronPdfService Extension to testwikis and mediawikiwiki - https://phabricator.wikimedia.org/T150944 [14:28:53] hashar: all good on my side! [14:29:40] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [14:30:40] addshore: awesome [14:30:46] zeljkof: managed to fx up the duplicates? [14:31:02] hashar: I'm not sure what to do [14:31:10] ask!!? :} [14:31:14] at the task? [14:31:22] so [14:31:34] one has to run the maintenance script to find out duplicate articles [14:31:34] mwscript namespaceDupes.php --wiki=kuwiki [14:31:46] that spurts 8 links that have to be fixed [14:31:54] such as: ns=0 dbk=WP:AC -> Wîkîpediya:AC [14:32:22] If you look at https://ku.wikipedia.org/wiki/WP:AC [14:32:34] the redirects you to the canonical page https://ku.wikipedia.org/wiki/Wîkîpediya:AC [14:32:41] but the actual content is still in WP:AC [14:32:47] the redirect kind of hide the content [14:33:09] to fix them, run the script with --fix [14:33:10] mwscript namespaceDupes.php --wiki=kuwiki [14:33:14] that will rename the pages [14:33:27] hashar: ok, I already did that [14:34:12] https://phabricator.wikimedia.org/T152815#2864753 [14:35:36] anything else I should do? [14:37:11] * zeljkof is confused [14:40:50] hashar: ^ [14:41:38] ahhh [14:42:36] sorry [14:42:48] I am not sure why it does not fix the 8 that are apparently fixable [14:43:18] guess it choke on the WP:HOTCAT conflict [14:43:34] zeljkof: with --add-prefix that will rename one of the two pages [14:43:49] did it [14:44:24] hashar: swat done? [14:45:40] 06Operations, 10MediaWiki-API, 10Parsoid, 10RESTBase, and 6 others: HHVM request timeouts not working; support lowering the API request timeout per request - https://phabricator.wikimedia.org/T97192#2864816 (10GWicke) >>! In T97192#2861483, @Joe wrote: > Just for the record, the reason requests piled up in... [14:46:36] Probably easier to sort revisions, than to sort pages, use --merge in the future [14:47:04] To have several pages is more confusing, and dirtier, as we have the history split into two places [14:47:26] it will deny merge if chronological revision order doesn't seem logical [14:47:38] 06Operations, 10ops-eqiad: mw1239: memory scrubbing error - https://phabricator.wikimedia.org/T148421#2864829 (10elukey) @Cmjohnson ping :) [14:48:42] zeljkof: yes [14:49:44] hashar: party time then ;) 🎉 [14:49:58] there are still some issues with the kuwiki page links though [14:50:30] I have left comments at the task [14:50:46] can they resolve the problems manually? or do we need to run the script? [14:51:55] zeljkof, I think the pages aren't accessible. We can revert the change, move the pages manually to another namespace, revert the revert and move them back. [14:52:47] hashar: what do you think about that? ^ [14:52:54] too complicated? :D [14:53:04] I am not sure really [14:54:53] they don't seem to exist [14:55:04] there is no page with title 'AC' [14:56:59] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:57:49] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [14:59:33] !log elastic@eqiad: T152092 - reindexing all wikis (except spaceless language wikis) from terbium, logs in ~dcausse/bm25_reindex/cirrus_log [14:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:46] T152092: Activate BM25 on all but wikis with spaceless languages - https://phabricator.wikimedia.org/T152092 [15:00:03] hashar: do we do something? or not? [15:00:07] (03CR) 10Reedy: [C: 031] "I don't think it matters much.. Other than preventing it on the 1st January 2017 ;)" [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) (owner: 10Reedy) [15:00:41] (03CR) 10Ema: [V: 032 C: 032] Update default config file [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/326426 (owner: 10Ema) [15:00:50] (03PS8) 10Reedy: Add cronjob for regenerating captchas [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) [15:02:59] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:02:59] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:03:49] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [15:03:50] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [15:05:28] zeljkof: so done yes [15:05:48] hashar: all done? [15:06:55] yes [15:06:58] \O/ [15:13:55] 06Operations, 13Patch-For-Review: python-confluent-kafka conflict with snakebite on stat1002 - https://phabricator.wikimedia.org/T152771#2864858 (10Ottomata) Thanks Volans! Will try to fix these packages today (or at least one of them). [15:16:47] !log varnishkafka 1.0.12-2 uploaded to carbon [15:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:08] (03PS10) 10Alexandros Kosiaris: Add profile::kubernetes::node profile class [puppet] - 10https://gerrit.wikimedia.org/r/324212 [15:24:09] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch: Decrease time required to fully restart the Cirrus elasticsearch clusters - https://phabricator.wikimedia.org/T145065#2864889 (10Gehel) In the latest cluster restart, we did manage to restart a few nodes in < 3 minutes when writes are disabled... [15:24:10] (03PS9) 10Alexandros Kosiaris: Include ::profile::kubernetes::node in role::kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/324213 [15:24:13] (03PS1) 10Alexandros Kosiaris: k8s::apiserver: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/326441 [15:24:30] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [15:24:48] (03CR) 10Marostegui: [C: 032] mariadb: Added gtid_domain_id to its own variable [puppet] - 10https://gerrit.wikimedia.org/r/326086 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [15:26:31] (03CR) 10Mark Bergsma: "> This needs to be done carefully and surely depend on the context:" [puppet] - 10https://gerrit.wikimedia.org/r/326144 (https://phabricator.wikimedia.org/T97192) (owner: 10Mark Bergsma) [15:26:51] (03PS2) 10BBlack: tlsproxy: remove unused keepalives complexity [puppet] - 10https://gerrit.wikimedia.org/r/323865 (https://phabricator.wikimedia.org/T107749) [15:27:02] (03CR) 10BBlack: [V: 032 C: 032] tlsproxy: remove unused keepalives complexity [puppet] - 10https://gerrit.wikimedia.org/r/323865 (https://phabricator.wikimedia.org/T107749) (owner: 10BBlack) [15:27:13] 06Operations, 10Wikimedia-Logstash: Get 5xx logs into kibana/logstash - https://phabricator.wikimedia.org/T149451#2864911 (10Ottomata) We could set up a special varnishkafka instance for this, if that makes sense. But, hm, I think using kafkatee would be better! kafkatee supports piped output, so we don't ha... [15:27:17] (03PS2) 10BBlack: tlsproxy: be explicit about Conn:close [puppet] - 10https://gerrit.wikimedia.org/r/323866 (https://phabricator.wikimedia.org/T107749) [15:27:23] (03CR) 10BBlack: [V: 032 C: 032] tlsproxy: be explicit about Conn:close [puppet] - 10https://gerrit.wikimedia.org/r/323866 (https://phabricator.wikimedia.org/T107749) (owner: 10BBlack) [15:28:28] 06Operations, 10Wikimedia-Logstash: Get 5xx logs into kibana/logstash - https://phabricator.wikimedia.org/T149451#2864916 (10Ottomata) Hm, actually, it might even be nicer to feed 5xx logs back into a dedicated topic in kafka. If we did that, then we could use logstash's Kafka importer to consume the full 5xx... [15:29:21] 07Puppet, 10Beta-Cluster-Infrastructure: deployment-eventlogging03 has puppet failure due to missing class - https://phabricator.wikimedia.org/T152842#2864920 (10Ottomata) Just asked this on the ticket, will re-ask here: Can I remove this role inclusion somehow? I'm looking in horizon, but I don't see the ro... [15:30:00] (03PS1) 10BBlack: normalize host header a little better [puppet] - 10https://gerrit.wikimedia.org/r/326443 [15:31:39] 06Operations, 10vm-requests: Site: (1) VM request for kubernetes - https://phabricator.wikimedia.org/T152966#2864924 (10akosiaris) [15:32:05] (03CR) 10BBlack: [C: 032] normalize host header a little better [puppet] - 10https://gerrit.wikimedia.org/r/326443 (owner: 10BBlack) [15:32:08] 06Operations, 10vm-requests: Site: 2 VM request for kubernetes - https://phabricator.wikimedia.org/T152966#2864936 (10akosiaris) p:05Triage>03Normal a:03akosiaris [15:32:52] (03PS2) 10Giuseppe Lavagetto: puppetmaster: add script to generate and sign ECDSA certificates [puppet] - 10https://gerrit.wikimedia.org/r/326418 [15:34:34] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: add script to generate and sign ECDSA certificates [puppet] - 10https://gerrit.wikimedia.org/r/326418 (owner: 10Giuseppe Lavagetto) [15:35:14] (03PS12) 10BBlack: VCL refactor: split cache/app backend support [puppet] - 10https://gerrit.wikimedia.org/r/324942 (https://phabricator.wikimedia.org/T110717) [15:35:16] (03PS3) 10BBlack: Varnish: remove "varnish-be-rand" conftool service [puppet] - 10https://gerrit.wikimedia.org/r/325798 (https://phabricator.wikimedia.org/T110717) [15:35:18] (03PS15) 10BBlack: cache_misc app_directors/req_handling split [puppet] - 10https://gerrit.wikimedia.org/r/300574 (https://phabricator.wikimedia.org/T110717) [15:35:20] (03PS15) 10BBlack: cache_misc req_handling: sort entries [puppet] - 10https://gerrit.wikimedia.org/r/300579 (https://phabricator.wikimedia.org/T110717) [15:35:22] (03PS13) 10BBlack: cache_misc req_handling: add force-pass support [puppet] - 10https://gerrit.wikimedia.org/r/300581 (https://phabricator.wikimedia.org/T110717) [15:35:24] (03PS13) 10BBlack: cache_misc req_handling: subpaths and defaulting [puppet] - 10https://gerrit.wikimedia.org/r/300655 (https://phabricator.wikimedia.org/T110717) [15:35:59] 06Operations, 10Prod-Kubernetes, 10vm-requests, 07kubernetes: Site: 2 VM request for kubernetes - https://phabricator.wikimedia.org/T152966#2864944 (10akosiaris) [15:36:57] (03PS1) 10Alexandros Kosiaris: Introduce argon and chlorine [dns] - 10https://gerrit.wikimedia.org/r/326445 (https://phabricator.wikimedia.org/T152966) [15:37:20] (03PS1) 10Marostegui: mariadb: Enable gtid_domain_id - phabricator hosts [puppet] - 10https://gerrit.wikimedia.org/r/326446 (https://phabricator.wikimedia.org/T149418) [15:39:09] (03PS1) 10Ottomata: Remove webrequest_bits config from kafkatee input [puppet] - 10https://gerrit.wikimedia.org/r/326448 [15:39:18] (03PS2) 10Ottomata: Remove webrequest_bits config from kafkatee input [puppet] - 10https://gerrit.wikimedia.org/r/326448 [15:39:33] (03CR) 10Ottomata: [V: 032 C: 032] Remove webrequest_bits config from kafkatee input [puppet] - 10https://gerrit.wikimedia.org/r/326448 (owner: 10Ottomata) [15:41:09] PROBLEM - Host elastic2020 is DOWN: PING CRITICAL - Packet loss = 100% [15:42:34] looking ^ [15:43:08] it probably is down indeed [15:43:15] (03PS3) 10Giuseppe Lavagetto: puppetmaster: add script to generate and sign ECDSA certificates [puppet] - 10https://gerrit.wikimedia.org/r/326418 [15:43:18] unpingeable by a host next to it [15:43:46] it happened last time we switched traffic to codfw, not sure it's the same host though [15:45:08] yes same host: T149006 [15:45:08] T149006: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006 [15:45:15] "The server is not powered on." [15:45:15] that probably explains it [15:45:33] I suspect some hw instabilities on this one :/ [15:45:54] (03PS14) 10BBlack: cache_misc req_handling: subpaths and defaulting [puppet] - 10https://gerrit.wikimedia.org/r/300655 (https://phabricator.wikimedia.org/T110717) [15:46:26] the host went down few hours after the switch, and we switched to codfw during eu swat, so exactly the same configuration [15:46:47] date=12/12/2016 [15:46:47] time=07:38 [15:46:48] description=Server power removed. [15:47:09] description=Embedded Flash/SD-CARD: Restarted. [15:47:26] nothing else though [15:47:54] it's very suspicious that this host failed again just after we switched traffic to codfw [15:48:01] !log mobrovac@tin Starting deploy [changeprop/deploy@05f4e5d]: (no message) [15:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:52] the server clearly believes the power was removed ... [15:48:53] !log mobrovac@tin Finished deploy [changeprop/deploy@05f4e5d]: (no message) (duration: 00m 52s) [15:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:07] let's see if I can power it up [15:49:16] akosiaris: thanks [15:49:49] nope.. it's refusing [15:49:59] ok so exactly the same issue [15:52:49] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [15:53:07] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#2864965 (10dcausse) 05Resolved>03Open Reopening, this host went down today few hours after we switched all search traffic to codfw. @akosia... [15:53:44] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#2864967 (10akosiaris) Reopening The server is exhibiting the exact same symptoms. It reports it was powered off by power removal ``` hpiLO... [15:54:45] (03PS2) 10Alexandros Kosiaris: kube-scheduler: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/326429 [15:54:47] (03PS2) 10Alexandros Kosiaris: k8s::controller: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/326430 [15:54:49] (03PS2) 10Alexandros Kosiaris: k8s::apiserver: Amend to support more than labs [puppet] - 10https://gerrit.wikimedia.org/r/326441 [15:54:51] (03PS11) 10Alexandros Kosiaris: Add profile::kubernetes::node profile class [puppet] - 10https://gerrit.wikimedia.org/r/324212 [15:54:53] (03PS10) 10Alexandros Kosiaris: Include ::profile::kubernetes::node in role::kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/324213 [15:54:55] (03PS1) 10Alexandros Kosiaris: k8s::apiserver: Remove unused master_host parameter [puppet] - 10https://gerrit.wikimedia.org/r/326452 [15:55:56] (03PS2) 10Volans: icinga: raid_handler improvements [puppet] - 10https://gerrit.wikimedia.org/r/321642 (https://phabricator.wikimedia.org/T149913) [15:56:29] PROBLEM - CirrusSearch codfw 95th percentile latency - more_like on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [2000.0] [15:57:29] RECOVERY - CirrusSearch codfw 95th percentile latency - more_like on graphite1001 is OK: OK: Less than 20.00% above the threshold [1200.0] [15:57:32] (03PS4) 10Giuseppe Lavagetto: puppetmaster: add script to generate and sign ECDSA certificates [puppet] - 10https://gerrit.wikimedia.org/r/326418 [15:58:22] latency is likely due to re-shuffling shards and should sort itself [16:01:29] PROBLEM - CirrusSearch codfw 95th percentile latency - more_like on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [2000.0] [16:02:30] (03CR) 10Madhuvishy: [C: 031] labsdb: cleanup maintain-meta_p enough to make it viable [puppet] - 10https://gerrit.wikimedia.org/r/325949 (owner: 10Rush) [16:02:47] ebernhardson: yes, and the morelike cache is not fully populated we just switched 2 hours ago [16:04:02] <_joe_> bbiab [16:08:26] dcausse: interestingly, the spike doesn't look limited to more like in the elasticsearch-percentiles dashboard, it's across the board [16:08:39] except comp suggest [16:08:51] yes this one is still on eqiad [16:09:13] oh, of course :) [16:10:29] doh, yes fulltext latencies are very bad :/ [16:12:09] (03CR) 10BryanDavis: [C: 031] "I've put this patch up for Puppet SWAT on 2016-12-13" [puppet] - 10https://gerrit.wikimedia.org/r/313035 (owner: 10Anomie) [16:12:49] (03PS1) 10Ottomata: Add debian patch to remove install usr/LICENSE [debs/python-confluent-kafka] (debian) - 10https://gerrit.wikimedia.org/r/326456 (https://phabricator.wikimedia.org/T152771) [16:17:29] 06Operations, 10Mail, 10MediaWiki-Email, 10Wikimedia-General-or-Unknown, and 3 others: Email server's DMARC config prevents users from sending emails via Special:EmailUser - https://phabricator.wikimedia.org/T66795#2865073 (10Aklapper) @AnnaMariaKoshka: Feel free to create a separate task with more informa... [16:18:38] (03PS1) 10Ottomata: Revert "Revert "Add python-confluent-kafka to eventlogging::dependencies"" [puppet] - 10https://gerrit.wikimedia.org/r/326458 [16:20:12] (03PS13) 10BBlack: VCL refactor: split cache/app backend support [puppet] - 10https://gerrit.wikimedia.org/r/324942 (https://phabricator.wikimedia.org/T110717) [16:20:14] (03PS4) 10BBlack: Varnish: remove "varnish-be-rand" conftool service [puppet] - 10https://gerrit.wikimedia.org/r/325798 (https://phabricator.wikimedia.org/T110717) [16:20:16] (03PS16) 10BBlack: cache_misc app_directors/req_handling split [puppet] - 10https://gerrit.wikimedia.org/r/300574 (https://phabricator.wikimedia.org/T110717) [16:20:18] (03PS16) 10BBlack: cache_misc req_handling: sort entries [puppet] - 10https://gerrit.wikimedia.org/r/300579 (https://phabricator.wikimedia.org/T110717) [16:20:20] (03PS14) 10BBlack: cache_misc req_handling: add force-pass support [puppet] - 10https://gerrit.wikimedia.org/r/300581 (https://phabricator.wikimedia.org/T110717) [16:21:26] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster: add script to generate and sign ECDSA certificates [puppet] - 10https://gerrit.wikimedia.org/r/326418 (owner: 10Giuseppe Lavagetto) [16:21:40] !log puppet disabled on cache nodes for VCL work [16:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:27] <_joe_> bblack: my change above allows to manage ECDSA keys with the puppet CA [16:22:32] (03PS3) 10Reedy: Remove pear php-mail related packages [puppet] - 10https://gerrit.wikimedia.org/r/256119 [16:23:32] (03CR) 10jenkins-bot: [V: 04-1] Remove pear php-mail related packages [puppet] - 10https://gerrit.wikimedia.org/r/256119 (owner: 10Reedy) [16:23:38] (03PS1) 10Niharika29: Deploy scholarships with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/326461 (https://phabricator.wikimedia.org/T129134) [16:23:44] _joe_: \o/ [16:23:50] _joe_, there is no plan to substitute existing ones instantly, right? [16:23:55] <_joe_> bblack: nope [16:24:12] (03PS2) 10Ottomata: Revert "Revert "Add python-confluent-kafka to eventlogging::dependencies"" [puppet] - 10https://gerrit.wikimedia.org/r/326458 [16:24:23] is that answer to me? [16:24:30] this is for generating additional certs I think, not modifying the one the puppet client uses [16:24:34] <_joe_> jynus: yes [16:24:38] just happens to use the puppet CA for signing [16:24:38] <_joe_> sorry :P [16:24:43] that is good [16:25:03] did we ever get through the switch from 4k->2k for puppetca? I lost track [16:25:11] <_joe_> jynus: this is for now thought for any sservice that needs a cert shared across a cluster and wants ECDSA [16:25:21] <_joe_> bblack: still nope [16:26:11] (03PS4) 10Reedy: Remove pear php-mail related packages [puppet] - 10https://gerrit.wikimedia.org/r/256119 [16:26:12] _joe_: I think you didn't click Submit [16:26:15] 06Operations, 10Traffic, 13Patch-For-Review: python-varnishapi daemons seeing "Log overrun" constantly - https://phabricator.wikimedia.org/T151643#2865079 (10ema) With python-varnishapi, varnishlog.py and friends we're essentially solving the (already solved) problem of reading from VSM, and we're not doing... [16:26:15] actually, I am worrying unnecesarelly- my pains were when I changed the CA, not the key(certs themselves [16:26:17] maybe [16:26:24] <_joe_> bblack: I did, it's merged [16:26:33] oh no, it's the gerrit rev# in the link on my end [16:26:37] (03CR) 10Ottomata: [V: 032 C: 032] Revert "Revert "Add python-confluent-kafka to eventlogging::dependencies"" [puppet] - 10https://gerrit.wikimedia.org/r/326458 (owner: 10Ottomata) [16:26:46] (03Abandoned) 10Reedy: Remove pear php-mail related packages [puppet] - 10https://gerrit.wikimedia.org/r/256119 (owner: 10Reedy) [16:26:46] I hate that I click on links with /13 and then it's ignoring /14 till I fix it [16:27:04] (03PS14) 10BBlack: VCL refactor: split cache/app backend support [puppet] - 10https://gerrit.wikimedia.org/r/324942 (https://phabricator.wikimedia.org/T110717) [16:27:09] (03CR) 10BBlack: [V: 032 C: 032] VCL refactor: split cache/app backend support [puppet] - 10https://gerrit.wikimedia.org/r/324942 (https://phabricator.wikimedia.org/T110717) (owner: 10BBlack) [16:28:43] 06Operations, 10Traffic: Deploy infra ganeti cluster @ ulsfo - https://phabricator.wikimedia.org/T96852#2865086 (10faidon) [16:31:14] (03CR) 10jenkins-bot: [V: 04-1] Deploy scholarships with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/326461 (https://phabricator.wikimedia.org/T129134) (owner: 10Niharika29) [16:35:03] _joe_: I forget, is there some magic to completely removing a Service from conftool-data at runtime? like, all per-node entries for that service must be depooled before the sync will delete it or something? or remove the service from the nodes in conftool-data in a separate preceding commit from removing the service itself? [16:35:28] <_joe_> bblack: the latter IIRC [16:35:42] <_joe_> I mean removing everything toghether should work [16:38:12] (03PS2) 10Niharika29: Deploy scholarships with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/326461 (https://phabricator.wikimedia.org/T129134) [16:39:29] RECOVERY - CirrusSearch codfw 95th percentile latency - more_like on graphite1001 is OK: OK: Less than 20.00% above the threshold [1200.0] [16:40:15] (03PS5) 10BBlack: Varnish: remove "varnish-be-rand" conftool service 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/325798 (https://phabricator.wikimedia.org/T110717) [16:40:17] (03PS17) 10BBlack: cache_misc app_directors/req_handling split [puppet] - 10https://gerrit.wikimedia.org/r/300574 (https://phabricator.wikimedia.org/T110717) [16:40:19] (03PS17) 10BBlack: cache_misc req_handling: sort entries [puppet] - 10https://gerrit.wikimedia.org/r/300579 (https://phabricator.wikimedia.org/T110717) [16:40:21] (03PS15) 10BBlack: cache_misc req_handling: add force-pass support [puppet] - 10https://gerrit.wikimedia.org/r/300581 (https://phabricator.wikimedia.org/T110717) [16:40:23] (03PS16) 10BBlack: cache_misc req_handling: subpaths and defaulting [puppet] - 10https://gerrit.wikimedia.org/r/300655 (https://phabricator.wikimedia.org/T110717) [16:40:25] (03PS1) 10BBlack: Varnish: remove "varnish-be-rand" conftool service 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/326471 (https://phabricator.wikimedia.org/T110717) [16:41:32] (03PS3) 10Jcrespo: mariadb: puppetize misc-dumps cron, which was missing [puppet] - 10https://gerrit.wikimedia.org/r/325760 [16:47:29] PROBLEM - puppet last run on analytics1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:48:09] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 25 probes of 402 (alerts on 19) - https://atlas.ripe.net/measurements/1790945/#!map [16:48:18] !log bblack@puppetmaster1001 conftool action : set/pooled=no; selector: service=varnish-be-rand [16:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:16] (03CR) 10BBlack: [C: 032] Varnish: remove "varnish-be-rand" conftool service 1/2 [puppet] - 10https://gerrit.wikimedia.org/r/325798 (https://phabricator.wikimedia.org/T110717) (owner: 10BBlack) [16:49:39] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 66 probes of 405 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [16:50:46] (03Abandoned) 10Papaul: DNS: Add mgmt DNS for ms-fe200[5-8] Bug:T152612 [dns] - 10https://gerrit.wikimedia.org/r/326165 (https://phabricator.wikimedia.org/T152612) (owner: 10Papaul) [16:52:18] (03PS1) 10Papaul: DNS: Add mgmt DNS entries for ms-fe200[5-8] Bug:T152612 [dns] - 10https://gerrit.wikimedia.org/r/326474 (https://phabricator.wikimedia.org/T152612) [16:53:09] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 0 probes of 402 (alerts on 19) - https://atlas.ripe.net/measurements/1790945/#!map [16:54:39] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 1 probes of 405 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [17:10:57] (03CR) 10BryanDavis: [C: 04-1] Deploy scholarships with scap3 (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/326461 (https://phabricator.wikimedia.org/T129134) (owner: 10Niharika29) [17:16:32] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [17:17:16] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, and 2 others: Deploy ElectronPdfService Extension to metawiki - https://phabricator.wikimedia.org/T150943#2865279 (10Addshore) [17:17:19] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, and 2 others: Deploy ElectronPdfService Extension to testwikis and mediawikiwiki - https://phabricator.wikimedia.org/T150944#2865278 (10Addshore) 05Open>03Resolved [17:18:36] (03CR) 10BBlack: [C: 032] Varnish: remove "varnish-be-rand" conftool service 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/326471 (https://phabricator.wikimedia.org/T110717) (owner: 10BBlack) [17:24:45] 06Operations, 06Operations-Software-Development: conftool service removal bugs - https://phabricator.wikimedia.org/T152977#2865308 (10BBlack) [17:50:02] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:51:02] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:51:02] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [17:51:52] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [17:57:20] 07Puppet, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: puppet failure on deployment-phab01 ... is not a Hash. It looks to be a Array at /etc/puppet/modules/phabricator/manifests/init.pp:68 - https://phabricator.wikimedia.org/T147818#2865451 (10mmodell) `deployment-phab01` is [[ http://beta-phab.wmflab... [17:58:58] 06Operations, 10Electron-PDFs, 06TCB-Team, 15User-Addshore, 03WMDE-QWERTY-Team-Board: liuge - https://phabricator.wikimedia.org/T152985#2865455 (10Liugev6) [18:00:04] gehel: Dear anthropoid, the time has come. Please deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161212T1800). [18:02:23] !log Shutting down db2034 for maintenance - T149553 [18:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:36] T149553: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553 [18:02:42] SMalyshev: can you confirm no deployment on WDQS today? [18:05:15] (03CR) 10Andrew Bogott: [C: 031] Set SYS_UID_MAX and SYS_GID_MAX to 499 [puppet] - 10https://gerrit.wikimedia.org/r/326311 (https://phabricator.wikimedia.org/T45795) (owner: 10Tim Landscheidt) [18:06:12] PROBLEM - Disk space on elastic1030 is CRITICAL: DISK CRITICAL - free space: / 1921 MB (7% inode=90%) [18:06:22] 07Puppet, 10Beta-Cluster-Infrastructure: deployment-eventlogging03 has puppet failure due to missing class - https://phabricator.wikimedia.org/T152842#2865484 (10Krenair) I think that has come up before in T152472 [18:06:57] disk space on elastic1030 is probably me (reindex in progress) [18:08:22] ah no it's on / (looking) [18:13:08] deploying graphoid service, should not to conflict with gehel wdqs stuff. [18:13:22] RECOVERY - Host elastic2020 is UP: PING OK - Packet loss = 0%, RTA = 36.18 ms [18:13:29] dcausse: it might be trace logs still active (should not be I'll check) [18:13:49] !log truncating production-search-eqiad.log on elastic1030 [18:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:20] gehel: it's the same issue we had few months ago (trying to find it) [18:15:12] RECOVERY - Disk space on elastic1030 is OK: DISK OK [18:15:59] gehel: https://github.com/elastic/elasticsearch/issues/19187 [18:16:45] dcausse: we were blocked on the 2.4 upgrade, do you remember if this has been resolved? [18:16:51] gwicke: hi [18:17:20] gehel: yes, I think they fixed elastic. and cirrus should no longer create invalid indices [18:18:14] !log yurik@tin Starting deploy [graphoid/deploy@e71f316]: (no message) [18:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:43] (03PS1) 10Ottomata: Install kafkacat on Kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/326484 [18:19:00] !log yurik@tin Finished deploy [graphoid/deploy@e71f316]: (no message) (duration: 00m 47s) [18:19:02] dcausse: so we could upgrade to 2.4.2? [18:19:10] gehel: in theory yes :) [18:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:39] dcausse: ok, we'll need to try that one! But might be a bit short before the freeze. Let's at least try to document this... [18:19:50] gehel: makes sense [18:20:34] (03CR) 10Ottomata: [C: 032] Install kafkacat on Kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/326484 (owner: 10Ottomata) [18:23:13] 07Puppet, 10Beta-Cluster-Infrastructure: deployment-eventlogging03 has puppet failure due to missing class - https://phabricator.wikimedia.org/T152842#2861773 (10Andrew) In the horizon gui, when I click on the 'all' filter, I see the role right there. 'Remove Role' should do what you want. [18:26:00] 07Puppet, 10Beta-Cluster-Infrastructure: deployment-eventlogging03 has puppet failure due to missing class - https://phabricator.wikimedia.org/T152842#2865570 (10Krenair) I just changed to the 'all' tab (thanks @andrew) and found the old classes, then removed them. I think it broke things: Notice: /Stage[main]... [18:30:04] kaldari: Dear anthropoid, the time has come. Please deploy Running populateLocalAndGlobalIds.php maintenance script (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161212T1830). [18:32:29] !log foreachwiki extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php [18:32:39] (03PS1) 10Ottomata: Remove no longer needed statistics::migration role [puppet] - 10https://gerrit.wikimedia.org/r/326489 [18:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:27] (03PS3) 10Niharika29: Deploy scholarships with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/326461 (https://phabricator.wikimedia.org/T129134) [18:33:32] PROBLEM - puppet last run on kafka1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:34:46] !log yurik@tin Starting deploy [kartotherian/deploy@9958ab2]: (no message) [18:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:08] (03PS4) 10Jcrespo: mariadb: puppetize misc-dumps cron, which was missing [puppet] - 10https://gerrit.wikimedia.org/r/325760 [18:35:45] (03PS1) 10Ottomata: s/stat1001/throrium [puppet] - 10https://gerrit.wikimedia.org/r/326490 (https://phabricator.wikimedia.org/T149438) [18:36:44] !log yurik@tin Finished deploy [kartotherian/deploy@9958ab2]: (no message) (duration: 01m 58s) [18:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:18] (03CR) 10Nuria: "Do we need to enable permits to that host for uses that have ssh access now to stat1001?" [puppet] - 10https://gerrit.wikimedia.org/r/326490 (https://phabricator.wikimedia.org/T149438) (owner: 10Ottomata) [18:38:45] kaldari: how much that script will take to finish? [18:38:59] MarcoA: about 15 hours [18:39:42] PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:39:52] PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:40:07] 07Puppet, 10Beta-Cluster-Infrastructure: deployment-eventlogging03 has puppet failure due to missing class - https://phabricator.wikimedia.org/T152842#2865632 (10Ottomata) Ah, ALL filter, duh. Hm, ok, yeah we need to have the newly refactored eventlogging roles included. I don't see them in the list of class... [18:40:18] kaldari: and after that renaming can continue or it's still not everything done? [18:41:08] after that, we'll need to turn renaming back on, which requires a config deployment, so it will probably happen tomorrow. But nothing else is blocking it. [18:41:32] RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient [18:41:42] RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:41:49] 07Puppet, 10Beta-Cluster-Infrastructure: deployment-eventlogging03 has puppet failure due to missing class - https://phabricator.wikimedia.org/T152842#2865636 (10Ottomata) Actually, I take it back! Other classes worked great. [18:42:38] (03PS1) 10Jcrespo: mariadb: Depool db1051 for schema change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326493 (https://phabricator.wikimedia.org/T69223) [18:42:52] (03CR) 10Jcrespo: [C: 032] mariadb: puppetize misc-dumps cron, which was missing [puppet] - 10https://gerrit.wikimedia.org/r/325760 (owner: 10Jcrespo) [18:43:42] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:43:42] PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:47:32] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:47:42] PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:47:52] PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:48:32] RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient [18:48:33] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:48:33] RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient [18:48:42] RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:49:17] was that load issues? [18:49:32] yeah I'm looking into it now [18:49:32] PROBLEM - puppet last run on db1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:51:32] load is high, so is swap [18:51:47] disk, is high, too [18:52:46] jouncebot: next [18:52:46] In 0 hour(s) and 7 minute(s): Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161212T1900) [18:53:06] * MarcoA rushes [18:54:25] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint: delete unused kartotherian marker metrics - https://phabricator.wikimedia.org/T150353#2865680 (10Yurik) Kartotherian should no longer send different marker metrics. Only `kartotherian.marker` and `kartotherian.err.marker.*`are the only ones being pub... [18:54:42] PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:54:43] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:54:47] (03PS3) 10Paladox: Gerrit: Install filebeat on gerrit server [puppet] - 10https://gerrit.wikimedia.org/r/326374 (https://phabricator.wikimedia.org/T141324) [18:55:01] jynus: yeah I think it is requests for big files that do that [18:55:18] either that needs work at app level [18:55:27] or that needs more resources [18:55:33] RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient [18:55:33] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:55:34] 99% of the times is app [18:56:01] yeah in this case too, I'm following up on the relevant tasks [18:56:09] if that service goes down, no new thumbnails? [18:56:40] ATM no, it is shadowing production traffic, mediawiki still serves the users [18:56:45] good [18:56:57] (03CR) 10Urbanecm: [C: 031] "Fine for me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306443 (https://phabricator.wikimedia.org/T143789) (owner: 10MarcoAurelio) [18:57:13] yeah I'd be much more worried if that wasn't the case [18:57:18] yeah [19:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161212T1900). [19:00:04] matt_flaschen, hoo, Niharika, Addshore, and MatmaRex: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [19:00:29] hi [19:00:31] o/ [19:00:37] hi [19:01:10] o/ [19:01:18] Present [19:01:21] (03CR) 10MarcoAurelio: [C: 04-1] "Per task. Meta-Wiki seems to have not being asked before. Sorry :-(" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326235 (https://phabricator.wikimedia.org/T152656) (owner: 10Mattflaschen) [19:01:31] I can SWAT. [19:01:32] RECOVERY - puppet last run on kafka1003 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [19:02:04] (03PS3) 10Thcipriani: Enable GuidedTour on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326235 (https://phabricator.wikimedia.org/T152656) (owner: 10Mattflaschen) [19:02:20] thcipriani: that patch set has a -1 [19:02:30] by me, I can take the blame [19:02:58] ah, I see that just now [19:03:14] ^ matt_flaschen you all can discuss [19:04:03] (03PS2) 10Thcipriani: Load the property order from Wikidata per default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326133 (https://phabricator.wikimedia.org/T149540) (owner: 10Hoo man) [19:04:18] I'm sorry for being a "jerk" with that, but the unending discussions at Meta later are far worse [19:04:23] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326133 (https://phabricator.wikimedia.org/T149540) (owner: 10Hoo man) [19:05:14] (03Merged) 10jenkins-bot: Load the property order from Wikidata per default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326133 (https://phabricator.wikimedia.org/T149540) (owner: 10Hoo man) [19:05:18] MarcoA, I don't think this requires community consensus. It doesn't impact the user experience unless they choose to use the tour. [19:05:24] MarcoA, I'll follow up on the task. [19:05:28] ^ James_F [19:06:00] MarcoA, I understand you're not comfortable with us SWATting it out this window, we'll discuss on the task first and figure out what to do in the future. [19:06:46] matt_flaschen: I don't want to block foundation's work, so if that is an "official" change I think I'll let it go through [19:06:46] hoo: your patch is live on mwdebug1002, check please [19:07:06] thcipriani: ok [19:07:16] MarcoA: It is (for the WRC team); I've asked María to respond. [19:07:40] (03PS2) 10Thcipriani: Convert wikis to numerical sorting and uca collation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323834 (https://phabricator.wikimedia.org/T149002) (owner: 10Niharika29) [19:08:04] thcipriani, is it ok if i scap3 kartotherian service in parallel? I accidentally got an older version published (not a biggy, but annoying) [19:08:50] Doh, tested on the wrong wiki [19:08:56] need a few more moments [19:09:01] hoo: np [19:09:30] yurik: Service deployment window is in < 1hr if it can wait. [19:09:48] 06Operations, 10Mail, 10MediaWiki-Email, 10Wikimedia-General-or-Unknown, and 3 others: Email server's DMARC config prevents users from sending emails via Special:EmailUser - https://phabricator.wikimedia.org/T66795#1044685 (10Johan) I changed the email address on one of my accounts (one I use when I give p... [19:09:51] thcipriani, absolutely, will wait. I will add a patch for SWAT if ok :) [19:10:07] yurik: go for it :) [19:10:09] thcipriani: Confirmed, works [19:10:13] tested w/ dewiki [19:10:19] hoo: ok, going live everywhere. [19:10:52] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [19:11:45] MarcoA, noted, thanks. [19:11:56] !log thcipriani@tin Synchronized wmf-config/Wikibase-production.php: SWAT: [[gerrit:326133|Load the property order from Wikidata per default]] T149540 (duration: 00m 45s) [19:12:01] ^ hoo live everywhere [19:12:04] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323834 (https://phabricator.wikimedia.org/T149002) (owner: 10Niharika29) [19:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:10] T149540: Load the property order from Wikidata on Wikimedia wikis - https://phabricator.wikimedia.org/T149540 [19:12:21] thcipriani, I'm ready to test the metawiki change if you deploy it. James_F, is María responding right now or later? [19:12:27] thcipriani: thanks [19:12:40] (03Merged) 10jenkins-bot: Convert wikis to numerical sorting and uca collation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323834 (https://phabricator.wikimedia.org/T149002) (owner: 10Niharika29) [19:12:44] matt_flaschen: sorry, I'm dealing with a hundreds of things now. Can be merged with -1 though [19:12:49] matt_flaschen: Probably later. [19:13:02] matt_flaschen: sure. I cant get it out after the uca patch. [19:13:08] *can :) [19:14:35] Niharika: I've pulled the uca patch to mwdebug1002, anything to check there before maintenance scripts run? [19:14:56] thcipriani: Nothing broken. [19:15:01] Good to go live. [19:15:06] Niharika: ok, going live. [19:15:32] PROBLEM - puppet last run on db1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:16:14] (03PS4) 10Thcipriani: Enable GuidedTour on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326235 (https://phabricator.wikimedia.org/T152656) (owner: 10Mattflaschen) [19:16:19] MarcoA: matt_flaschen: we're going to deploy guided tour everywhere [19:16:25] (03CR) 10Marostegui: [C: 031] mariadb: Depool db1051 for schema change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326493 (https://phabricator.wikimedia.org/T69223) (owner: 10Jcrespo) [19:16:32] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [19:16:40] There is still time to discuss the change, as it needs to be configured in MediaWiki: space. [19:16:43] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:323834|Convert wikis to numerical sorting and uca collation]] (T149002) (duration: 00m 45s) [19:16:53] 06Operations, 06Performance-Team, 10Thumbor: Thumbor resource consumption is spiky - https://phabricator.wikimedia.org/T151851#2865764 (10fgiunchedi) The spikes in load average on thumbor machines keep happening, it correlates with high iowait too. The iowait doesn't seem related to swap activity, rather to... [19:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:56] T149002: Convert wikis to numerical sorting batch #4 - https://phabricator.wikimedia.org/T149002 [19:17:03] ^ Niharika change is live everywhere. Will you be running maintenance scripts? [19:17:14] thcipriani: I will. Thanks. [19:17:19] Niharika: thank you! [19:17:23] matt_flaschen: for meta., the ideal discussion/notification venue seems to be https://meta.wikimedia.org/wiki/Meta:Babel [19:17:26] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326235 (https://phabricator.wikimedia.org/T152656) (owner: 10Mattflaschen) [19:17:32] RECOVERY - puppet last run on db1009 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [19:18:02] (03Merged) 10jenkins-bot: Enable GuidedTour on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326235 (https://phabricator.wikimedia.org/T152656) (owner: 10Mattflaschen) [19:18:04] Dereckson: okay, with reservations though [19:18:25] !log mobrovac@tin Starting deploy [changeprop/deploy@84f162c]: (no message) [19:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:58] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, and 2 others: Deploy ElectronPdfService Extension to metawiki - https://phabricator.wikimedia.org/T150943#2865773 (10MaxSem) [19:19:09] 06Operations, 10Electron-PDFs, 06TCB-Team, 13Patch-For-Review, and 2 others: Deploy ElectronPdfService Extension to metawiki - https://phabricator.wikimedia.org/T150943#2802094 (10MaxSem) [19:19:49] !log mobrovac@tin Finished deploy [changeprop/deploy@84f162c]: (no message) (duration: 01m 24s) [19:19:52] PROBLEM - puppet last run on install2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:09] (03PS3) 10Filippo Giunchedi: Revert "Revert "RESTBase configuration for fi.wikivoyage.org"" [puppet] - 10https://gerrit.wikimedia.org/r/324766 (https://phabricator.wikimedia.org/T151570) (owner: 10Alex Monk) [19:20:09] matt_flaschen: metawiki tour change is live on mwdebug1002, check please [19:20:15] MarcoA: Dereckson ^ [19:20:39] thcipriani: I can't test it, I'm busy with other things sorry :( [19:20:53] MarcoA: no problem. Just an FYI for you. [19:20:54] thcipriani: Any change we could also hijack https://gerrit.wikimedia.org/r/325962 into SWAT? [19:21:04] @marostegui: Started the centralAuth populating script again. Lag predictably spiked for db1028, but also db1069. Is db1069 also set to priority 0? [19:21:23] I'll test it. [19:21:25] I've stopped the script for now. [19:21:44] Dereckson, yeah, we were planning to do that in the new year: T152827 [19:21:45] T152827: Enable GuidedTour on all wikis - https://phabricator.wikimedia.org/T152827 [19:21:51] Dereckson, did you hear about something other than that? [19:21:52] hoo: add it to the calendar, I'll try to get to it. I think it should be fine. [19:22:27] @jynus: Started the centralAuth populating script again. Lag predictably spiked for db1028, but also db1069. Is db1069 also set to priority 0? I've stopped the script for now. [19:22:36] thcipriani: Great, thanks [19:24:26] db1069 is not part of core production, kaldari [19:24:29] it will not page [19:24:40] go on [19:24:44] with the script [19:24:47] @jynus: OK, I won't worry about it then :) [19:25:33] ostriches: submodule bumps still not a thing? [19:25:34] thcipriani, not working. [19:26:01] * thcipriani doublechecks mwdebug [19:26:06] kaldari, use this as a guide: https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag [19:26:27] thcipriani: Should be fixed for extensions and skins meta-repos, just haven't sorted the wmf branches yet for core. [19:26:57] @jynus: What is 1069 for? Labs? [19:27:09] it is part of the production filtering [19:27:16] (03PS4) 10Hoo man: Revert "Add Abenaki language (abe) to Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325962 (https://phabricator.wikimedia.org/T150633) (owner: 10Thiemo Mättig (WMDE)) [19:27:18] (03PS1) 10Hoo man: Add comment about $wmgExtraLanguageNames['wikidata'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326510 [19:27:19] matt_flaschen: no that's what I had in mind, I noted on the task I think it's a good idea, as nothing change on the UI as long as it's not configured [19:27:27] (03PS3) 10Gehel: Add configs for LDF server [puppet] - 10https://gerrit.wikimedia.org/r/317282 (https://phabricator.wikimedia.org/T136358) (owner: 10Smalyshev) [19:27:33] there is not much we can do about it, assuming this will only take a few hours [19:27:34] matt_flaschen: hrm. Change is definitely in place on mwdebug1002 :\ [19:28:03] log that db1028 can get behind so that other ops are aware, if you can [19:28:54] that and labs [19:29:10] thcipriani, first it wasn't working on Meta at all, then I got an error, but I know the cause of that now. [19:30:15] thcipriani: Added the changes to wikitech [19:30:32] PROBLEM - puppet last run on kafka1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:32:08] addshore: could you take a look/+1 this change to bump the electronpdfservice submodule on core for your change: https://gerrit.wikimedia.org/r/#/c/326520/ [19:32:12] hoo: thank you! [19:32:21] (03CR) 10Dereckson: [C: 031] Add comment about $wmgExtraLanguageNames['wikidata'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326510 (owner: 10Hoo man) [19:32:37] looking [19:33:29] thcipriani: +1ed [19:33:35] addshore: thanks :) [19:34:36] (03PS2) 10Filippo Giunchedi: visualdiff: Install uprightdiff package [puppet] - 10https://gerrit.wikimedia.org/r/326053 (owner: 10Legoktm) [19:35:28] thcipriani, I fixed that issue. You can enable it for real now. [19:35:47] matt_flaschen: ok, thanks! going live everywhere. [19:37:30] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:326235|Enable GuidedTour on metawiki]] T152656 (duration: 00m 47s) [19:37:37] ^ matt_flaschen live everywhere [19:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:44] T152656: Install GuidedTour extension on Meta - https://phabricator.wikimedia.org/T152656 [19:37:52] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [19:39:00] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325962 (https://phabricator.wikimedia.org/T150633) (owner: 10Thiemo Mättig (WMDE)) [19:39:12] (03PS6) 10Andrew Bogott: Add clientlib.pp and mwopenstackclients.py [puppet] - 10https://gerrit.wikimedia.org/r/325830 (https://phabricator.wikimedia.org/T150092) [19:39:45] (03Merged) 10jenkins-bot: Revert "Add Abenaki language (abe) to Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325962 (https://phabricator.wikimedia.org/T150633) (owner: 10Thiemo Mättig (WMDE)) [19:40:05] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326510 (owner: 10Hoo man) [19:40:54] (03Merged) 10jenkins-bot: Add comment about $wmgExtraLanguageNames['wikidata'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326510 (owner: 10Hoo man) [19:41:05] hoo: your revert is live on mwdebug1002, check please [19:41:21] thcipriani: Will try [19:41:56] ... [19:42:27] hoo: the new one in Wikidata code hasn't been put in production yet [19:42:38] Dereckson: That should be fine [19:43:20] Any dev to possibly fix something? [19:43:32] RECOVERY - puppet last run on db1031 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [19:43:34] https://www.wikidata.org/wiki/Special:Contributions/Fralambert [19:44:07] yes indeed, the user who requested abe is busy to other things [19:44:12] https://quarry.wmflabs.org/query/14059 <- that last entry, it’s ‘unfixable’, it’s a file that was deleted because it’s ‘corrupted’ and not an audio file. [19:44:27] (03CR) 10Gehel: [C: 032] Add configs for LDF server [puppet] - 10https://gerrit.wikimedia.org/r/317282 (https://phabricator.wikimedia.org/T136358) (owner: 10Smalyshev) [19:44:55] The entry in the transcode table just ‘needs’ (not that it’s important, really) to go poof. [19:45:59] Dereckson: If it's very important we can properly deploy it this week [19:46:09] but people already started to add lables in that lang [19:46:14] (I removed them) [19:46:23] That's a problem [19:46:28] you can't even remove them afterwards [19:46:36] at least not easily [19:46:45] Dereckson: thcipriani: The change works for me [19:46:54] If Dereckson has objections, we can postpone [19:47:02] but I don't see a reason why [19:47:07] it's not urgent [19:47:53] RECOVERY - puppet last run on install2001 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [19:48:06] ^ Dereckson up to you, could deploy or revert [19:48:27] thcipriani: go ahead [19:48:37] to get it cleanly is preferable [19:48:39] ok, going live [19:49:03] (03CR) 10Andrew Bogott: [C: 032] Add clientlib.pp and mwopenstackclients.py [puppet] - 10https://gerrit.wikimedia.org/r/325830 (https://phabricator.wikimedia.org/T150092) (owner: 10Andrew Bogott) [19:49:12] (03PS7) 10Andrew Bogott: Add clientlib.pp and mwopenstackclients.py [puppet] - 10https://gerrit.wikimedia.org/r/325830 (https://phabricator.wikimedia.org/T150092) [19:50:32] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:325962|Revert "Add Abenaki language (abe) to Wikidata"]] T150633 (duration: 00m 46s) [19:50:38] ^ hoo Dereckson live everywhere [19:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:43] T150633: Please enable abe or Western Abenaki - https://phabricator.wikimedia.org/T150633 [19:51:30] MatmaRex: your change is live on mwdebug1002 [19:53:48] thcipriani: thanks. seems to work! [19:53:55] thcipriani: Thanks again [19:53:59] MatmaRex: awesome! Going live. [19:54:08] addshore: your change is on mwdebug1002, check please [19:54:14] ack, checking [19:54:41] thcipriani: looks good to roll out [19:55:18] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#2866010 (10Papaul) @akosiaris i just removed the PSU's for a couple of minutes and plugged them back in. The server is back up but i am workin... [19:55:51] addshore: cool, will go live after next sync [19:56:25] !log thcipriani@tin Synchronized php-1.29.0-wmf.5/includes/page/ImageHistoryPseudoPager.php: SWAT: [[gerrit:326497|ImageHistoryPseudoPager: Do not ignore limit from URL]] T152813 (duration: 00m 45s) [19:56:26] ^ MatmaRex live everywhere [19:56:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:38] T152813: File histories won't display more than 10 files on a page - https://phabricator.wikimedia.org/T152813 [19:56:54] thanks thcipriani! [19:57:54] MatmaRex: yw! :) [19:58:01] !log gehel@tin Starting deploy [wdqs/wdqs@cb82bdd]: (no message) [19:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:27] !log thcipriani@tin Synchronized php-1.29.0-wmf.5/extensions/ElectronPdfService/specials/SpecialElectronPdf.php: SWAT: [[gerrit:326437|Include namespace when setting hidden form field]] (duration: 00m 44s) [19:58:31] (03PS1) 10Mobrovac: Trending Edits: Add to SCB [puppet] - 10https://gerrit.wikimedia.org/r/326527 (https://phabricator.wikimedia.org/T150043) [19:58:32] RECOVERY - puppet last run on kafka1012 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [19:58:33] ^ addshore live everywhere [19:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:42] PROBLEM - puppet last run on ms-be1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:58:46] thcipriani: ack [19:58:46] (03PS1) 10Mobrovac: Trending Edits: LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/326528 (https://phabricator.wikimedia.org/T150043) [19:59:10] (03PS1) 10Mobrovac: RESTBase: Add trending edits service config portion [puppet] - 10https://gerrit.wikimedia.org/r/326529 (https://phabricator.wikimedia.org/T150043) [19:59:28] !log gehel@tin Finished deploy [wdqs/wdqs@cb82bdd]: (no message) (duration: 01m 27s) [19:59:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:25] (03CR) 10Mobrovac: [C: 04-1] "Needs If9bc7ec2fbba271234d91bcaccd933629d3fd60d and https://github.com/wikimedia/restbase/pull/727" [puppet] - 10https://gerrit.wikimedia.org/r/326529 (https://phabricator.wikimedia.org/T150043) (owner: 10Mobrovac) [20:17:45] (03CR) 10Filippo Giunchedi: prometheus: export gdnsd stats via node_exporter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/325975 (https://phabricator.wikimedia.org/T147426) (owner: 10Filippo Giunchedi) [20:17:57] (03PS4) 10Filippo Giunchedi: prometheus: export gdnsd stats via node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/325975 (https://phabricator.wikimedia.org/T147426) [20:18:45] (03CR) 10Filippo Giunchedi: [C: 032] visualdiff: Install uprightdiff package [puppet] - 10https://gerrit.wikimedia.org/r/326053 (owner: 10Legoktm) [20:18:51] (03PS3) 10Filippo Giunchedi: visualdiff: Install uprightdiff package [puppet] - 10https://gerrit.wikimedia.org/r/326053 (owner: 10Legoktm) [20:21:54] (03PS2) 10Filippo Giunchedi: contint: Add dependencies needed for PoolCounter tests [puppet] - 10https://gerrit.wikimedia.org/r/325145 (https://phabricator.wikimedia.org/T152338) (owner: 10Legoktm) [20:23:11] (03CR) 10Filippo Giunchedi: [C: 032] contint: Add dependencies needed for PoolCounter tests [puppet] - 10https://gerrit.wikimedia.org/r/325145 (https://phabricator.wikimedia.org/T152338) (owner: 10Legoktm) [20:25:57] (03PS2) 10Andrew Bogott: Designate policy: Add public read-only access [puppet] - 10https://gerrit.wikimedia.org/r/325994 (https://phabricator.wikimedia.org/T150092) [20:27:31] (03CR) 10Andrew Bogott: [C: 032] Designate policy: Add public read-only access [puppet] - 10https://gerrit.wikimedia.org/r/325994 (https://phabricator.wikimedia.org/T150092) (owner: 10Andrew Bogott) [20:27:42] RECOVERY - puppet last run on ms-be1014 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [20:29:47] 06Operations, 10hardware-requests: Mediawiki log host for eqiad to replace fluorine - https://phabricator.wikimedia.org/T153008#2866197 (10fgiunchedi) [20:36:06] !log upgrade grafana to 4.0.2 on krypton - T152473 [20:36:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:18] T152473: Upgrade labmon1001 Grafana to 4.0.1 - https://phabricator.wikimedia.org/T152473 [20:36:54] 06Operations, 06Performance-Team: Upgrade Grafana to 4.0.2 - https://phabricator.wikimedia.org/T152473#2866232 (10fgiunchedi) [20:39:23] thcipriani, if i need to deploy a patch, do i still need to check in the bump to the core? [20:40:14] yurik: yeah, for the time being, gerrit troubles mean that the submodule bump on core is manual for a while :( [20:42:58] thcipriani: I've got a patch incoming to make-wmf-branch that should fix it for upcoming branches [20:44:36] thcipriani: And bam: https://gerrit.wikimedia.org/r/#/c/326543/ [20:45:48] oh! didn't realize that's what was happening :) [20:46:08] That's half of it [20:46:15] It got super strict with url matching [20:46:17] Which is dumb af. [20:46:28] Other half is config-based, but we're already fixing that [21:00:04] gwicke, cscott, arlolra, subbu, bearND, mdholloway, halfak, Amir1, and yurik: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161212T2100). [21:00:50] i'm about to push a few patches [21:00:58] e.g. kartotherian service [21:00:59] (03PS2) 10Jforrester: Provide the visual editor wikitext mode Beta Feature to all [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321993 [21:03:44] !log yurik@tin Starting deploy [kartotherian/deploy@68761ce]: (no message) [21:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:26] yurik, let me know once you are done and i'll go for parsoid [21:04:47] nod [21:06:22] (03CR) 10Ottomata: [C: 032] Add debian patch to remove install usr/LICENSE [debs/python-confluent-kafka] (debian) - 10https://gerrit.wikimedia.org/r/326456 (https://phabricator.wikimedia.org/T152771) (owner: 10Ottomata) [21:07:32] !log yurik@tin Finished deploy [kartotherian/deploy@68761ce]: (no message) (duration: 03m 48s) [21:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:11] subbu, done for now, will need to do something in a bit, let me know when done. [21:12:02] yurik, ok .. updating beta cluster right now. will be a few mins. [21:13:34] !log starting parsoid deploy [21:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:56] !log ssastry@tin Starting deploy [parsoid/deploy@7316a90]: Updating Parsoid config [21:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:24] !log ssastry@tin Finished deploy [parsoid/deploy@7316a90]: Updating Parsoid config (duration: 11m 27s) [21:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:26] !log mholloway-shell@tin Starting deploy [mobileapps/deploy@cc4dcf2]: Update mobileapps to 2a8ad26 [21:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:36] !log mholloway-shell@tin Finished deploy [mobileapps/deploy@cc4dcf2]: Update mobileapps to 2a8ad26 (duration: 01m 11s) [21:32:47] !log ssastry@tin Starting deploy [parsoid/deploy@7316a90]: (no message) [21:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:58] !log ssastry@tin Finished deploy [parsoid/deploy@7316a90]: (no message) (duration: 02m 11s) [21:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:08] subbu, still going? [21:45:24] yurik, sorry .. done. [21:45:30] thx :) [21:47:42] staging JsonConfig patch on mwdebug1002.eqiad.wmnet [21:55:13] !log yurik@tin Synchronized php-1.29.0-wmf.5/extensions/JsonConfig: jsonconfig ext bump https://gerrit.wikimedia.org/r/#/c/326540/ (duration: 01m 08s) [21:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:04] dapatrick, bawolff, and Reedy: Respected human, time to deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161212T2200). Please do the needful. [22:01:26] 07Puppet, 10Continuous-Integration-Config: rake-jessie tests check .pp files but are not triggered by .pp file changes - https://phabricator.wikimedia.org/T153013#2866448 (10Tgr) [22:19:33] PROBLEM - puppet last run on db1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:25:03] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#2866487 (10Papaul) I contact HP, according to them the log file I sent to them is not showing any hardware failure and showing only 1 power sup... [22:26:38] (03PS1) 10Andrew Bogott: Keystone: Allow public listing of roles and role assignments. [puppet] - 10https://gerrit.wikimedia.org/r/326827 (https://phabricator.wikimedia.org/T152708) [22:26:40] (03PS1) 10Andrew Bogott: Keystone: Monitor project membership [puppet] - 10https://gerrit.wikimedia.org/r/326828 (https://phabricator.wikimedia.org/T152708) [22:27:53] (03CR) 10jenkins-bot: [V: 04-1] Keystone: Monitor project membership [puppet] - 10https://gerrit.wikimedia.org/r/326828 (https://phabricator.wikimedia.org/T152708) (owner: 10Andrew Bogott) [22:33:55] (03PS2) 10Andrew Bogott: Keystone: Allow public listing of roles and role assignments. [puppet] - 10https://gerrit.wikimedia.org/r/326827 (https://phabricator.wikimedia.org/T152708) [22:33:57] (03PS2) 10Andrew Bogott: Keystone: Monitor project membership [puppet] - 10https://gerrit.wikimedia.org/r/326828 (https://phabricator.wikimedia.org/T152708) [22:35:02] (03CR) 10jenkins-bot: [V: 04-1] Keystone: Monitor project membership [puppet] - 10https://gerrit.wikimedia.org/r/326828 (https://phabricator.wikimedia.org/T152708) (owner: 10Andrew Bogott) [22:35:47] (03CR) 10Andrew Bogott: [C: 032] Keystone: Allow public listing of roles and role assignments. [puppet] - 10https://gerrit.wikimedia.org/r/326827 (https://phabricator.wikimedia.org/T152708) (owner: 10Andrew Bogott) [22:37:58] (03PS3) 10Andrew Bogott: Keystone: Monitor project membership [puppet] - 10https://gerrit.wikimedia.org/r/326828 (https://phabricator.wikimedia.org/T152708) [22:39:05] (03PS4) 10Andrew Bogott: Keystone: Monitor project membership [puppet] - 10https://gerrit.wikimedia.org/r/326828 (https://phabricator.wikimedia.org/T152708) [22:43:53] (03PS5) 10Andrew Bogott: Keystone: Monitor project membership [puppet] - 10https://gerrit.wikimedia.org/r/326828 (https://phabricator.wikimedia.org/T152708) [22:43:55] (03PS1) 10Andrew Bogott: Move observerenv.sh to /usr/local/bin [puppet] - 10https://gerrit.wikimedia.org/r/326833 [22:44:28] (03CR) 10Andrew Bogott: "Note that this still installs bogus passwords on labs... that will be addressed after the monitoring framework is running." [puppet] - 10https://gerrit.wikimedia.org/r/326833 (owner: 10Andrew Bogott) [22:44:40] If there is anyone around with… I guess it is dev access to directly edit the DB, a poke would be welcome. [22:45:07] It’s a single line, so doing the whole phab thing seems silly. [22:45:45] likely restricted/deployment/ops access [22:45:50] what database? [22:46:39] What database and what needs doing? [22:47:24] commonswiki.transcode [22:47:33] RECOVERY - puppet last run on db1016 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [22:48:01] (03PS6) 10Andrew Bogott: Keystone: Monitor project membership [puppet] - 10https://gerrit.wikimedia.org/r/326828 (https://phabricator.wikimedia.org/T152708) [22:48:01] Revent: What about it? [22:48:09] https://quarry.wmflabs.org/query/14059 <- the entry for Miglyd.wav is ‘garbage’ [22:48:31] That query is the ‘uninitialized transcodes’ on Commons. [22:48:55] I have successfully crapped 15k entries back through the scalers. [22:49:22] Miglyd.wav is a deleted file, that was a corrupted garbage. [22:49:50] We don't typically edit the DB directly. Is there any harm in just letting it work its way out of the processing queue? [22:49:51] transcoding is done by the TimedMediaHandler extension, right? [22:50:01] Other deleted files that had entries (there were about a dozen) undeleting and redeleting made the entries go away. [22:50:08] ostriches: It’s not in the queue. [22:50:27] It can’t be in the queue, because it’s a deleted garbage file. [22:50:40] (it’s not actually a wav file) [22:50:42] Then....what's the problem? [22:51:15] It will remain, forever, as the only ‘uninitialized transcode’ unless someone nukes the line…. not a big deal, but annoying. [22:51:38] Sounds like that needs a bugfix in TMH then, not one-off mucking in the DB. What happens when this occurs again? [22:52:03] TMH should fix its transcode status if it's wrong [22:52:40] Umm… there were about a dozen cases of deleted files with entries, all ‘years’ old, and all fixed by undel and redel except that one.... [22:53:17] (out of something like 700k audio files on Commons) [22:53:35] I’m dubious enough the actual ‘bug’ is worth fixing, lol. [22:54:09] well, if you had to undelete and re-delete them... [22:54:12] Well, if it's happened a dozen or so cases, sounds like it's worht fixing to me! [22:54:12] There's a bigger bug worth fixing [22:54:36] That was still only about a dozen, and they were all from years ago. [22:54:53] So whatever might even already be fixed. [22:54:55] So it may have been fixed already? [22:55:01] Other than it not cleaning up other entries [22:55:01] Why can't we undelete & re-delete this final case then? [22:55:09] (if that's worked before) [22:55:15] ostriches: It’s not a valid audio file. [22:55:58] For whatever reason, it doesn’t work in this particular case, because it’s…. whatever it is, it’s not a audio file. [22:56:08] * ostriches nods [22:56:21] obfuscated porn? [22:56:49] (shrugs) The others, when undeleted, were added to the queue, and then removed (correctly) when redeleted. [22:57:39] (btw, the huge overload of the scalers is not my fault, it’s the upload of a huge pile of White House Press Briefings) [22:57:55] Surely you blame Trump for that? [22:58:09] they are in 1280P, and about an hour and a half long…. they will eventually all fail. [22:58:50] Revent, a bunch of those happened recently [22:58:56] A buunch… yes. [22:59:08] as server-side uploads because they were too big for normal upload [22:59:34] They are not transcodable other than maybe as 160P, until the scalers get more balls. [22:59:42] I dropped that transcode entry [22:59:48] ostriches: Thanks. [22:59:49] mysql:wikiadmin@db1040 [commonswiki]> delete from transcode where transcode_id = 412149; [22:59:49] Query OK, 1 row affected (0.00 sec) [23:00:14] Revent, are they causing issues with the videoscaling machines? [23:00:46] There is a phab ticket about it, it’s simply that the scalers time out before they run. [23:01:24] I’ve tested on some of the old ones, a 1280P video, at an hour and a half long, times out even if it’s the ONLY task running. [23:01:51] The scalers simply need more balls. [23:02:37] Revent: please use appropriate language. "balls" is not something servers have nor need. [23:02:44] 06Operations, 10scap, 06Release-Engineering-Team (Long-Lived-Branches): Make git 2.2.0+ (preferably 2.8.x) available - https://phabricator.wikimedia.org/T140927#2866564 (10hashar) [23:02:46] Fixing whatever ‘task management’, or whatever you want to call it, that will try to run 100+ transcodes at the same time would also help. [23:03:01] 06Operations, 10Wikimedia-Logstash: Get 5xx logs into kibana/logstash - https://phabricator.wikimedia.org/T149451#2866565 (10fgiunchedi) >>! In T149451#2864916, @Ottomata wrote: > Hm, actually, it might even be nicer to feed 5xx logs back into a dedicated topic in kafka. If we did that, then we could use logs... [23:03:12] greg-g: Sorry, slang. More ‘horsepower’ [23:03:30] Anyone know how to create a cron job in production? I assume this is managed by puppet, but I couldn't find any useful documentation. [23:03:34] I don't really know anything about the transcoding process. [23:03:43] 06Operations, 10scap, 06Release-Engineering-Team (Long-Lived-Branches): Make git 2.2.0+ (preferably 2.8.x) available - https://phabricator.wikimedia.org/T140927#2481328 (10hashar) Mentions git-submodule got ported to C with 2.9.0. `git submodule update` also learns `--jobs` to fetch changes in parallel (de... [23:03:50] That kind of suggestion needs to go via people who do [23:03:51] Revent: noted :) [23:04:00] kaldari, yeah, use the puppet cron resource [23:04:04] Krenair: for MW purposes? [23:04:07] ffs [23:04:09] There’s a ticket... [23:04:34] Revent, what? [23:04:35] kaldari: If it's a MW maintenance script you're looking to cron-ify, have a look at maintenance.pp in puppet [23:04:47] Reedy* [23:04:47] Either way, what Krenair said, cron{} is what you want [23:05:19] Krenair: I’m trying to find it, there is a ticket open about upgrading the videoscaler hardware to be able to handle larger uploads. [23:05:33] kaldari: mediawiki/maintenance.pp to be specific, not tendril [23:05:39] ostriches: or use my method keep trying til it stops spamming error emails :P [23:06:07] https://phabricator.wikimedia.org/T114337 <- this is old, I did not see it before. [23:06:18] https://phabricator.wikimedia.org/T150067 <- this is new [23:06:30] ostriches: thanks! [23:06:35] yw [23:07:49] Krenair: I’ve been working, on and off, of shoving old broken transcodes back through for like a month now. [23:08:33] There are a ton of old ones where just refreshing the page adds 5 or 6 entries to the table. [23:09:03] 06Operations, 10Wikimedia-General-or-Unknown, 10hardware-requests: Extend capacity for video scalers - https://phabricator.wikimedia.org/T150067#2773188 (10greg) Cross pollinating: {T114337} [23:09:08] (presumably because the desired targets have changed since the last time the page was edited) [23:09:18] 06Operations, 10TimedMediaHandler, 10hardware-requests: Assign 3 more servers to video scaler duty - https://phabricator.wikimedia.org/T114337#1691895 (10greg) Cross pollinating: {T150067} [23:09:20] :) [23:09:36] The added ones don’t get queued, they just go into ‘uninialized’ [23:09:45] *unitialized [23:16:40] (03CR) 10Andrew Bogott: [C: 032] Move observerenv.sh to /usr/local/bin [puppet] - 10https://gerrit.wikimedia.org/r/326833 (owner: 10Andrew Bogott) [23:16:54] (03CR) 10Andrew Bogott: [C: 032] Keystone: Monitor project membership [puppet] - 10https://gerrit.wikimedia.org/r/326828 (https://phabricator.wikimedia.org/T152708) (owner: 10Andrew Bogott) [23:18:38] ostriches: so I found tendril/maintenance.pp, but I can't seem to find mediawiki/maintenance.pp [23:19:18] ./modules/role/manifests/mediawiki/maintenance.pp [23:19:57] thanks, not sure why that didn't show up in my search :) [23:22:05] (03PS1) 10Odder: Add a logo for beta Meta-Wiki on Labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326842 (https://phabricator.wikimedia.org/T125942) [23:22:35] (03CR) 10Odder: "I ran optipng -o7 on all three files before uploading them." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326842 (https://phabricator.wikimedia.org/T125942) (owner: 10Odder) [23:22:39] Anyhow, after shoving probably 20k transcodes back through (a mindless thing while doing other stuff) I see two problems… 1) people are uploading stuff like a trip all the way around the Moscow ring railroad, at 1920P, and the scalers are too wimpy for that, and 2) there seems to be a… either a bug, or a design flaw, in how tasks are started, in that it will keep starting tasks until the server (per ganglia) runs out of memory, and that�� [23:22:40] far far more than the number of CPUs. [23:22:44] (03CR) 10jenkins-bot: [V: 04-1] Add a logo for beta Meta-Wiki on Labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326842 (https://phabricator.wikimedia.org/T125942) (owner: 10Odder) [23:23:03] Trying to multitask video scaling is not useful, it just makes them more likely to time out. [23:23:42] PROBLEM - puppet last run on db1063 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:24:10] It’s not a ‘parallel’ kind of task. [23:25:43] mutante: When running /etc/init.d/icinga reload in einsteinium, is there something I can do to get verbose output? [23:25:48] It's failing but not telling me why [23:28:32] PROBLEM - Check correctness of the icinga configuration on einsteinium is CRITICAL: Icinga configuration contains errors [23:28:56] (03PS1) 10Andrew Bogott: Include icinga dependencies for keystone checks. [puppet] - 10https://gerrit.wikimedia.org/r/326844 (https://phabricator.wikimedia.org/T152708) [23:29:18] (03CR) 10Odder: "SVG version is available at https://commons.wikimedia.org/wiki/File:Wikimedia-logo-meta-beta.svg" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326842 (https://phabricator.wikimedia.org/T125942) (owner: 10Odder) [23:31:11] (03CR) 10Andrew Bogott: [C: 032] Include icinga dependencies for keystone checks. [puppet] - 10https://gerrit.wikimedia.org/r/326844 (https://phabricator.wikimedia.org/T152708) (owner: 10Andrew Bogott) [23:31:52] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [23:38:14] _joe_ akosiaris if you get a chance tomorrow can you take a look at https://gerrit.wikimedia.org/r/#/c/323079 ? [23:39:08] also https://gerrit.wikimedia.org/r/#/c/325466 for redis_exporter but less urgent, I'm not fully sure about the best way to gather the various redis instances to add to the prometheus configuration [23:40:32] (03PS1) 10Andrew Bogott: Added check_keystone_roles command cfg for icinga [puppet] - 10https://gerrit.wikimedia.org/r/326845 (https://phabricator.wikimedia.org/T152708) [23:41:34] (03CR) 10Andrew Bogott: [C: 032] Added check_keystone_roles command cfg for icinga [puppet] - 10https://gerrit.wikimedia.org/r/326845 (https://phabricator.wikimedia.org/T152708) (owner: 10Andrew Bogott) [23:48:04] PROBLEM - puppet last run on elastic1027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:48:22] (03PS1) 10Filippo Giunchedi: install-server: add partman for restbase101[678] [puppet] - 10https://gerrit.wikimedia.org/r/326847 (https://phabricator.wikimedia.org/T150964) [23:48:34] RECOVERY - Check correctness of the icinga configuration on einsteinium is OK: Icinga configuration is correct [23:49:46] (03CR) 10Filippo Giunchedi: [C: 032] install-server: add partman for restbase101[678] [puppet] - 10https://gerrit.wikimedia.org/r/326847 (https://phabricator.wikimedia.org/T150964) (owner: 10Filippo Giunchedi) [23:51:08] 06Operations, 10hardware-requests: eqiad: (1) Mediawiki log host to replace fluorine - https://phabricator.wikimedia.org/T153008#2866762 (10RobH) [23:51:22] 06Operations, 10hardware-requests: eqiad: (1) Mediawiki log host to replace fluorine - https://phabricator.wikimedia.org/T153008#2866197 (10RobH) a:05RobH>03mark So for spare systems in eqiad, I have the following system already purcahsed in our spare pool: WMF4724, under warranty until 2018-12-05. Has d... [23:52:44] RECOVERY - puppet last run on db1063 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [23:55:04] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 18 failures. Last run 2 minutes ago with 18 failures. Failed resources (up to 3 shown): Exec[ip addr add 2620:0:861:103:10:64:32:28/64 dev eth0],Service[ferm],Service[diamond],Service[prometheus-node-exporter] [23:56:27] (03PS1) 10Andrew Bogott: Keystone: fix c/p error in role monitoring [puppet] - 10https://gerrit.wikimedia.org/r/326852 [23:56:43] dcausse: ebernhardson: could you check logstash with *ecwikimedia* as search? It seems there are issues with Cirrus [23:57:45] PROBLEM - puppet last run on ms-be1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:58:04] (03CR) 10Andrew Bogott: [C: 032] Keystone: fix c/p error in role monitoring [puppet] - 10https://gerrit.wikimedia.org/r/326852 (owner: 10Andrew Bogott)