[00:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy Evening SWAT (Max 8 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171206T0000). [00:00:06] subbu and Amir1: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:23] o/ [00:00:23] submits the terbium change and runs puppet [00:00:37] Thanks! [00:00:47] * subbu is around [00:01:41] TimStarling, a +1 on https://gerrit.wikimedia.org/r/#/c/395139/1 might be helpful to whoever is swatting. [00:02:10] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:02:11] (03CR) 10RobH: [C: 031] logstash: decommission logstash100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/395565 (https://phabricator.wikimedia.org/T175830) (owner: 10Gehel) [00:02:18] (03CR) 10Tim Starling: [C: 031] Enable RemexHTML on wikis with zero high priority linter errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395139 (https://phabricator.wikimedia.org/T182042) (owner: 10Subramanya Sastry) [00:03:05] any swatters around to deploy that? [00:03:28] (03CR) 10Dzahn: "es.php --wiki testwikidatawiki --max-time 900 --batch-size 200 --dispatch-interval 30 >/dev/null 2>&1' to '/usr/local/bin/mwscript extensi" [puppet] - 10https://gerrit.wikimedia.org/r/395672 (https://phabricator.wikimedia.org/T182159) (owner: 10Ladsgroup) [00:03:55] (03CR) 10Dzahn: "well, that paste wasn't great, but .. it worked and updated the command, no need to absent the old cron" [puppet] - 10https://gerrit.wikimedia.org/r/395672 (https://phabricator.wikimedia.org/T182159) (owner: 10Ladsgroup) [00:05:35] mutante: Oops, I forgot that [00:05:37] facepalm [00:05:39] sorry [00:06:09] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 4.891 second response time [00:06:58] I can SWAT [00:07:20] Amir1: it doesn't apply if the resource name doesn't change and only the command changes, i just wanted to make sure we dont leave remnants [00:07:25] (03PS2) 10Thcipriani: Enable RemexHTML on wikis with zero high priority linter errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395139 (https://phabricator.wikimedia.org/T182042) (owner: 10Subramanya Sastry) [00:07:28] or cronspam [00:07:39] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395139 (https://phabricator.wikimedia.org/T182042) (owner: 10Subramanya Sastry) [00:08:15] cool [00:09:09] (03Merged) 10jenkins-bot: Enable RemexHTML on wikis with zero high priority linter errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395139 (https://phabricator.wikimedia.org/T182042) (owner: 10Subramanya Sastry) [00:09:20] (03CR) 10jenkins-bot: Enable RemexHTML on wikis with zero high priority linter errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395139 (https://phabricator.wikimedia.org/T182042) (owner: 10Subramanya Sastry) [00:10:02] subbu: your change is live on mwdebug1002, check please [00:10:16] will do [00:10:57] (03PS3) 10Thcipriani: Enable description usage tracking for all wikis except commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395584 (https://phabricator.wikimedia.org/T106287) (owner: 10Ladsgroup) [00:11:18] hmm .. https://jam.wikipedia.org/wiki/Gobna-Jinaral_a_Jumieka?action=purge is excepting ... [00:11:56] i imagine it is from mwdebug1002 since it is from the firefox etension. [00:12:31] subbu: Invalid tidy driver: "RemexHTML" [00:12:42] is what I see in logs [00:13:08] TimStarling, know why that might be? [00:13:15] * subbu looks at the patch [00:13:28] * subbu tries a different wiki [00:13:42] stacktrace: https://phabricator.wikimedia.org/P6432 [00:14:09] caps [00:14:25] ah!! [00:14:27] 'mediawikiwiki' => [ 'driver' => 'RemexHtml' ], [00:14:28] ok. [00:14:32] 'alswikiquote' => [ 'driver' => 'RemexHTML' ], [00:14:51] thcipriani, can you revert? i'll push a new patch. [00:15:03] yep. [00:15:03] sorry I missed this when I reviewed it [00:16:53] (03PS1) 10Subramanya Sastry: s/RemexHTML/RemexHtml [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395675 [00:16:58] (03PS2) 10Dzahn: snapshot,prometheus,maintenance,otrs,archiva: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/395167 (https://phabricator.wikimedia.org/T177225) [00:17:27] there you go ^^ [00:18:09] * subbu is wondering if he will get the sticker [00:18:44] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395675 (owner: 10Subramanya Sastry) [00:19:25] I knew it: all clever ruse to get a sticker :) [00:19:38] * subbu is excited about it! [00:20:08] (03Merged) 10jenkins-bot: s/RemexHTML/RemexHtml [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395675 (owner: 10Subramanya Sastry) [00:20:30] (03CR) 10jenkins-bot: s/RemexHTML/RemexHtml [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395675 (owner: 10Subramanya Sastry) [00:20:42] subbu: new patch is on mwdebug, check please [00:20:48] will do [00:21:10] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:21:58] good on jamwiki .. let me try another wiki [00:23:22] nlwiktionary also good [00:23:37] thcipriani, see any other entries in the log? otherwise, good to go. [00:24:02] subbu: nothing from mwdebug1002, will go live [00:24:07] ty \o/ [00:24:15] thcipriani: mine is not testable [00:24:23] (03PS3) 10Cmjohnson: admin: allow Paul Norman (pnorman) to deploy kartotherian / tilerator [puppet] - 10https://gerrit.wikimedia.org/r/395481 (https://phabricator.wikimedia.org/T182066) (owner: 10Gehel) [00:24:57] (03PS4) 10Thcipriani: Enable description usage tracking for all wikis except commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395584 (https://phabricator.wikimedia.org/T106287) (owner: 10Ladsgroup) [00:25:11] (03CR) 10Dzahn: [C: 032] "this might cause some puppet fail on trusty but it's partially to confirm that and the fix isn't hard" [puppet] - 10https://gerrit.wikimedia.org/r/395167 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [00:25:19] (03PS3) 10Dzahn: snapshot,prometheus,maintenance,otrs,archiva: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/395167 (https://phabricator.wikimedia.org/T177225) [00:25:37] Amir1: okie doke, will get it live when I merge it and ping you when done [00:25:57] thcipriani, is it live everywhere now? [00:26:05] subbu: not just yet :) [00:26:09] ok. :) [00:27:17] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:395139|Enable RemexHTML on wikis with zero high priority linter errors]] T182042 (duration: 00m 51s) [00:27:23] ^ subbu now it's live everywhere [00:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:27] T182042: Enable RemexHTML on 170 additional wikis with zero high-priority linter errors (as of Mon, Dec 4, 2017) - https://phabricator.wikimedia.org/T182042 [00:27:45] ty ... TimStarling 178 of 881 wikis done :) [00:28:27] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395584 (https://phabricator.wikimedia.org/T106287) (owner: 10Ladsgroup) [00:32:09] (03CR) 10Dzahn: "fun, it also just worked fine on trusty hosts. the issue we saw earlier with mysql hosts seems unique to just these so far" [puppet] - 10https://gerrit.wikimedia.org/r/395167 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [00:32:35] (03Merged) 10jenkins-bot: Enable description usage tracking for all wikis except commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395584 (https://phabricator.wikimedia.org/T106287) (owner: 10Ladsgroup) [00:32:49] (03CR) 10jenkins-bot: Enable description usage tracking for all wikis except commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395584 (https://phabricator.wikimedia.org/T106287) (owner: 10Ladsgroup) [00:36:24] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:395584|Enable description usage tracking for all wikis except commons]] T106287 (duration: 00m 48s) [00:36:29] ^ Amir1 sync'd [00:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:32] T106287: [Tracking] Track descriptions usages separately (Create a new description usage aspect "D") - https://phabricator.wikimedia.org/T106287 [00:36:41] Thanks! [00:36:43] Monitoring [00:40:19] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 8.513 second response time [00:42:23] (03PS1) 10Dzahn: etherpad,gerrit,lists,openldap,ores,url_downloader,yubiauth: rm ganglia [puppet] - 10https://gerrit.wikimedia.org/r/395681 (https://phabricator.wikimedia.org/T177225) [00:42:39] I'm leaving for the day [00:42:43] see you later [00:42:44] o/ [00:44:03] bye Amir [00:44:21] (03CR) 10Dzahn: [C: 032] etherpad,gerrit,lists,openldap,ores,url_downloader,yubiauth: rm ganglia [puppet] - 10https://gerrit.wikimedia.org/r/395681 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [00:44:35] (03PS2) 10Dzahn: etherpad,gerrit,lists,openldap,ores,url_downloader,yubiauth: rm ganglia [puppet] - 10https://gerrit.wikimedia.org/r/395681 (https://phabricator.wikimedia.org/T177225) [00:45:59] PROBLEM - Long running screen/tmux on restbase2002 is CRITICAL: CRIT: Long running SCREEN process. (PID: 15442, 1741969s 1728000s). [00:54:19] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:55:10] (03PS1) 10Dzahn: deployment_server,debug_proxy,dumps,notebook,pmacct: rm ganglia [puppet] - 10https://gerrit.wikimedia.org/r/395683 (https://phabricator.wikimedia.org/T177225) [00:56:09] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 278 bytes in 3.878 second response time [00:56:26] (03PS2) 10Dzahn: deployment_server,debug_proxy,dumps,notebook,pmacct: rm ganglia [puppet] - 10https://gerrit.wikimedia.org/r/395683 (https://phabricator.wikimedia.org/T177225) [00:57:35] (03CR) 10Dzahn: [C: 032] deployment_server,debug_proxy,dumps,notebook,pmacct: rm ganglia [puppet] - 10https://gerrit.wikimedia.org/r/395683 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [00:59:19] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:20:19] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 6.496 second response time [01:23:19] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:38:10] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 0.477 second response time [01:45:29] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:20] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 6.494 second response time [01:53:29] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:01:20] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 7.104 second response time [02:03:33] (03PS1) 10Gergő Tisza: Deploy ReadingLists to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395687 (https://phabricator.wikimedia.org/T181107) [02:04:29] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:04:51] (03CR) 10jerkins-bot: [V: 04-1] Deploy ReadingLists to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395687 (https://phabricator.wikimedia.org/T181107) (owner: 10Gergő Tisza) [02:11:23] (03PS2) 10Gergő Tisza: Deploy ReadingLists to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395687 (https://phabricator.wikimedia.org/T181107) [02:13:19] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [02:34:29] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:39:10] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.10) (duration: 08m 53s) [02:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:44:19] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 2.492 second response time [02:52:29] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:58:00] (03PS3) 10Gergő Tisza: Deploy ReadingLists to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395687 (https://phabricator.wikimedia.org/T181107) [02:58:02] (03PS1) 10Gergő Tisza: Enable ReadingLists on all SUL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395688 (https://phabricator.wikimedia.org/T181107) [03:10:19] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [03:24:10] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 869.70 seconds [03:43:46] 10Operations, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad unresponsive - https://phabricator.wikimedia.org/T175625#3815079 (10RobH) [03:43:48] 10Operations, 10DC-Ops: update firmware on scs consoles - https://phabricator.wikimedia.org/T174475#3815080 (10RobH) [03:56:10] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 238.77 seconds [03:56:28] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3815087 (10StevenJ81) BTW, @MarcoAurelio, I didn't mean to come off as combative (and don't think my comment actually was, to a na... [04:02:51] 10Operations, 10ops-eqsin: rack/setup scs-eqsin.mgmt.eqsin.wmnet - https://phabricator.wikimedia.org/T181569#3815088 (10RobH) Unfortunately, this scs console seems to be doa. Both Arzhel and I have attempted to use it, with no success. The console server powers on, shows power LEDs. However, the network por... [04:04:12] Can someone confirm or deny if this account is actually wmf? https://en.wikipedia.org/wiki/User:Sec%22%2727%5Cu00226%5Cx3ETest_(WMF) [04:23:45] 10Operations, 10ops-eqsin: rack/setup/install bast5001 - https://phabricator.wikimedia.org/T181554#3815103 (10RobH) [04:28:03] (03CR) 10EBernhardson: [C: 031] Revert "Revert "Deploy MjoLniR with new deploy repository"" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/394002 (owner: 10EBernhardson) [04:29:46] 10Operations, 10ops-eqsin: rack/setup/install dns500[12] - https://phabricator.wikimedia.org/T181556#3815106 (10RobH) [04:31:44] 10Operations, 10ops-eqsin: rack/setup/install lvs500[123] - https://phabricator.wikimedia.org/T182171#3815107 (10RobH) [04:38:05] 10Operations, 10ops-eqsin: rack/setup/install lvs500[123] - https://phabricator.wikimedia.org/T182171#3815124 (10RobH) [04:40:50] PROBLEM - Disk space on stat1004 is CRITICAL: DISK CRITICAL - /mnt/hdfs is not accessible: Transport endpoint is not connected [04:42:03] 10Operations, 10ops-eqsin: rack/setup/install cp50(0[1-9]|1[0-2]) - https://phabricator.wikimedia.org/T181557#3815125 (10RobH) [04:45:59] RECOVERY - Long running screen/tmux on restbase2002 is OK: OK: No SCREEN or tmux processes detected. [04:56:07] 10Operations, 10ops-eqsin: rack/setup/install cp50(0[1-9]|1[0-2]) - https://phabricator.wikimedia.org/T181557#3815126 (10RobH) [04:57:58] (03Abandoned) 10EBernhardson: Don't retry InitImageDataJob's [puppet] - 10https://gerrit.wikimedia.org/r/326151 (owner: 10EBernhardson) [05:30:39] PROBLEM - Check Varnish expiry mailbox lag on cp4024 is CRITICAL: CRITICAL: expiry mailbox lag is 2045473 [06:17:47] (03PS6) 10BBlack: eqsin: basics [puppet] - 10https://gerrit.wikimedia.org/r/389741 (https://phabricator.wikimedia.org/T156027) [06:30:29] PROBLEM - puppet last run on mw2127 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/hphpd/hphpd.ini] [06:31:20] PROBLEM - Check systemd state on rhenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:32:20] RECOVERY - Check systemd state on rhenium is OK: OK - running: The system is fully operational [06:35:46] (03CR) 10BBlack: [C: 032] eqsin: basics [puppet] - 10https://gerrit.wikimedia.org/r/389741 (https://phabricator.wikimedia.org/T156027) (owner: 10BBlack) [06:39:51] PROBLEM - NTP peers on achernar is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [06:40:10] NTP is probably me, I'll see [06:40:49] RECOVERY - NTP peers on achernar is OK: NTP OK: Offset 2.3e-05 secs [06:41:10] PROBLEM - NTP peers on acamar is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [06:41:48] yeah, the NTP alerts are from my change, but there's not a ton that needs doing about them [06:41:59] all the ntp server configs changed, they're all going to restart and lose sync temporarily, etc [06:41:59] PROBLEM - NTP peers on nescio is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [06:42:10] RECOVERY - NTP peers on acamar is OK: NTP OK: Offset 0.000243 secs [06:42:46] it does seem crazy it hit 3/6 of the existing ones within a ~3 minute window out of 30 [06:42:57] $fqdn_rand woes :P [06:42:59] RECOVERY - NTP peers on nescio is OK: NTP OK: Offset 8.7e-05 secs [06:45:05] <_joe_> bblack: indeed... [06:45:59] PROBLEM - NTP peers on hydrogen is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [06:46:59] RECOVERY - NTP peers on hydrogen is OK: NTP OK: Offset 7.8e-05 secs [06:47:21] so we should get a couple more of those for chromium and maerlant, hopefully that's all the spam fallout [06:47:26] better than I was expecting! :) [06:54:29] PROBLEM - NTP peers on maerlant is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [06:54:40] PROBLEM - NTP peers on chromium is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [06:55:40] RECOVERY - NTP peers on chromium is OK: NTP OK: Offset 0.000766 secs [06:56:29] RECOVERY - NTP peers on maerlant is OK: NTP OK: Offset -2e-05 secs [07:00:27] RECOVERY - puppet last run on mw2127 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:01:00] (03PS1) 10Marostegui: s7.hosts: Add db1098:3317 [software] - 10https://gerrit.wikimedia.org/r/395693 (https://phabricator.wikimedia.org/T178359) [07:03:55] (03CR) 10Marostegui: [C: 032] s7.hosts: Add db1098:3317 [software] - 10https://gerrit.wikimedia.org/r/395693 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [07:04:37] (03Merged) 10jenkins-bot: s7.hosts: Add db1098:3317 [software] - 10https://gerrit.wikimedia.org/r/395693 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [07:05:48] (03PS1) 10Gergő Tisza: Add cron job for purging ReadingLists data [puppet] - 10https://gerrit.wikimedia.org/r/395694 (https://phabricator.wikimedia.org/T181107) [07:06:22] (03CR) 10jerkins-bot: [V: 04-1] Add cron job for purging ReadingLists data [puppet] - 10https://gerrit.wikimedia.org/r/395694 (https://phabricator.wikimedia.org/T181107) (owner: 10Gergő Tisza) [07:11:06] PROBLEM - puppet last run on mw2110 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:12:17] PROBLEM - puppet last run on mw2226 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:12:26] PROBLEM - puppet last run on mw2193 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:13:07] <_joe_> uh [07:13:26] PROBLEM - puppet last run on db2065 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check-fresh-files-in-dir.py] [07:13:33] <_joe_> checking ^^ [07:13:56] PROBLEM - puppet last run on mw2203 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:15:02] (03PS2) 10Gergő Tisza: Add cron job for purging ReadingLists data [puppet] - 10https://gerrit.wikimedia.org/r/395694 (https://phabricator.wikimedia.org/T181107) [07:16:24] (03CR) 10jerkins-bot: [V: 04-1] Add cron job for purging ReadingLists data [puppet] - 10https://gerrit.wikimedia.org/r/395694 (https://phabricator.wikimedia.org/T181107) (owner: 10Gergő Tisza) [07:18:02] <_joe_> but /win 41 [07:19:53] <_joe_> it seems it was a transient group of failures. [07:25:44] (03PS1) 10Giuseppe Lavagetto: wmflib: use string for parameter of package, not symbol [puppet] - 10https://gerrit.wikimedia.org/r/395695 [07:26:57] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 21 probes of 294 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [07:31:57] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 1 probes of 294 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [07:33:56] !log cp4024 - backend restart, mailbox lag [07:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:26] RECOVERY - puppet last run on mw2226 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [07:37:26] RECOVERY - puppet last run on mw2193 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [07:38:26] RECOVERY - puppet last run on db2065 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:38:47] RECOVERY - puppet last run on mw2203 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:40:37] RECOVERY - Check Varnish expiry mailbox lag on cp4024 is OK: OK: expiry mailbox lag is 0 [07:41:06] RECOVERY - puppet last run on mw2110 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:54:33] (03PS3) 10Gergő Tisza: Add cron job for purging ReadingLists data [puppet] - 10https://gerrit.wikimedia.org/r/395694 (https://phabricator.wikimedia.org/T181107) [07:55:10] (03CR) 10jerkins-bot: [V: 04-1] Add cron job for purging ReadingLists data [puppet] - 10https://gerrit.wikimedia.org/r/395694 (https://phabricator.wikimedia.org/T181107) (owner: 10Gergő Tisza) [08:04:55] (03PS2) 10Muehlenhoff: Switch remaining eqiad video scalers to stretch [puppet] - 10https://gerrit.wikimedia.org/r/395494 [08:05:57] !log reimaging mw2118 to stretch (now for real, yesterday's reimage logged to SAL was interrupted) [08:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:39] (03PS4) 10Gergő Tisza: Add cron job for purging ReadingLists data [puppet] - 10https://gerrit.wikimedia.org/r/395694 (https://phabricator.wikimedia.org/T181107) [08:22:29] (03CR) 10Giuseppe Lavagetto: [C: 031] Switch remaining eqiad video scalers to stretch [puppet] - 10https://gerrit.wikimedia.org/r/395494 (owner: 10Muehlenhoff) [08:24:45] (03CR) 10Addshore: Make wikidata cronjobs use the Wikibase extension and not the build (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/395672 (https://phabricator.wikimedia.org/T182159) (owner: 10Ladsgroup) [08:32:12] (03PS1) 10Addshore: Make wikidatawiki cronjobs use the Wikibase extension and not the build [puppet] - 10https://gerrit.wikimedia.org/r/395699 (https://phabricator.wikimedia.org/T182159) [08:33:15] (03PS1) 10Addshore: Update location of wikibase-rebuildTermSqlIndex script [puppet] - 10https://gerrit.wikimedia.org/r/395700 [08:36:21] 10Operations, 10Wikidata, 10User-Addshore: operations-puppet still uses the wikidata build in places - https://phabricator.wikimedia.org/T182176#3815207 (10Addshore) p:05Triage>03Unbreak! [08:36:55] 10Operations, 10Wikidata, 10User-Addshore: operations-puppet still uses the wikidata build in places - https://phabricator.wikimedia.org/T182176#3815220 (10Addshore) [08:37:35] 10Operations, 10Wikidata, 10User-Addshore: operations-puppet still uses the wikidata build in places - https://phabricator.wikimedia.org/T182176#3815207 (10Addshore) [08:40:16] 10Operations, 10Wikidata, 10User-Addshore: operations-puppet still uses the wikidata build in places - https://phabricator.wikimedia.org/T182176#3815233 (10Addshore) [08:43:50] 10Operations, 10Puppet, 10Wikidata, 10User-Addshore: dumpwikidatajson.sh and dumpwikidatardf.sh still use the wikidata build in puppet - https://phabricator.wikimedia.org/T182175#3815243 (10Addshore) [08:43:59] 10Operations, 10Puppet, 10Wikidata, 10User-Addshore: operations-puppet still uses the wikidata build in places - https://phabricator.wikimedia.org/T182176#3815244 (10Addshore) [08:44:15] 10Operations, 10Puppet, 10Wikidata, 10Patch-For-Review, 10User-Addshore: dispatchChanges.php still being run from the Wikidata build - https://phabricator.wikimedia.org/T182159#3815245 (10Addshore) [08:44:23] (03PS1) 10Addshore: Update dumpwikidatajson & dumpwikidatardf to not use Wikidata build [puppet] - 10https://gerrit.wikimedia.org/r/395701 (https://phabricator.wikimedia.org/T182175) [08:53:52] 10Operations, 10ops-eqsin: rack/setup/install dns500[12] - https://phabricator.wikimedia.org/T181556#3815261 (10RobH) [08:58:17] (03PS5) 10Elukey: role::analytics_cluster::hive/oozie: move to profiles [puppet] - 10https://gerrit.wikimedia.org/r/395527 (https://phabricator.wikimedia.org/T167790) [08:59:21] (03PS2) 10Hashar: Move contint::package_builder to a profile [puppet] - 10https://gerrit.wikimedia.org/r/392893 [09:00:08] (03PS3) 10Hashar: Move contint::package_builder to a profile [puppet] - 10https://gerrit.wikimedia.org/r/392893 [09:04:18] PROBLEM - DPKG on mw2118 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:04:47] PROBLEM - HHVM jobrunner on mw2118 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.073 second response time [09:05:18] RECOVERY - DPKG on mw2118 is OK: All packages OK [09:07:18] PROBLEM - puppet last run on mw2118 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 1 minute ago with 4 failures. Failed resources (up to 3 shown): File[/etc/firejail/mediawiki-converters.profile],Package[smartmontools],File[/etc/systemd/system/prometheus-node-exporter.service.d/puppet-override.conf],Package[lilypond] [09:12:47] RECOVERY - HHVM jobrunner on mw2118 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.152 second response time [09:13:48] (03PS3) 10Hashar: Move contint::hhvm to a profile [puppet] - 10https://gerrit.wikimedia.org/r/392925 [09:15:39] (03PS2) 10Hashar: contint: libcurl4-gnutls-dev is now absent [puppet] - 10https://gerrit.wikimedia.org/r/392928 [09:16:38] (03PS2) 10Hashar: Install hhvm dev packages from the profile [puppet] - 10https://gerrit.wikimedia.org/r/392929 [09:17:10] (03CR) 10Hashar: "That is one less cruft class in the contint module :]" [puppet] - 10https://gerrit.wikimedia.org/r/392929 (owner: 10Hashar) [09:17:17] RECOVERY - puppet last run on mw2118 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:19:55] (03PS2) 10Hashar: Move contint::browsertests to a profile [puppet] - 10https://gerrit.wikimedia.org/r/392956 [09:19:57] (03PS2) 10Hashar: Move contint::browsers to a profile [puppet] - 10https://gerrit.wikimedia.org/r/392976 [09:34:41] !log shuttting down logstash / elasticsearch on logstash100[123] in preparation for decommission -T175830 [09:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:54] T175830: decommission logstash100[1-3] - https://phabricator.wikimedia.org/T175830 [09:37:13] (03CR) 10Elukey: "Some suggestions, going to discuss them on IRC with Joseph" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/395504 (https://phabricator.wikimedia.org/T176983) (owner: 10Joal) [09:38:39] (03PS2) 10Gehel: logstash: decommission logstash100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/395565 (https://phabricator.wikimedia.org/T175830) [09:42:19] (03CR) 10Gehel: [C: 032] logstash: decommission logstash100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/395565 (https://phabricator.wikimedia.org/T175830) (owner: 10Gehel) [09:48:19] (03PS19) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) [09:48:32] (03PS7) 10TerraCodes: Remove single editor tab for plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393121 (https://phabricator.wikimedia.org/T181045) [09:48:39] (03PS15) 10TerraCodes: Add loginwiki and wikidata to $wgLocalVirtualHosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392999 (https://phabricator.wikimedia.org/T117302) [09:55:29] 10Operations, 10Wikimedia-Logstash, 10hardware-requests, 10Discovery-Search (Current work), 10Patch-For-Review: decommission logstash100[1-3] - https://phabricator.wikimedia.org/T175830#3815354 (10Gehel) [09:57:03] 10Operations, 10Wikimedia-Logstash, 10hardware-requests, 10Discovery-Search (Current work), 10Patch-For-Review: decommission logstash100[1-3] - https://phabricator.wikimedia.org/T175830#3604520 (10Gehel) a:05Gehel>03RobH My steps for decommissioning are done (see checklist in the task description). A... [10:01:41] (03CR) 10Elukey: [C: 032] role::analytics_cluster::hive/oozie: move to profiles [puppet] - 10https://gerrit.wikimedia.org/r/395527 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [10:01:48] (03PS6) 10Elukey: role::analytics_cluster::hive/oozie: move to profiles [puppet] - 10https://gerrit.wikimedia.org/r/395527 (https://phabricator.wikimedia.org/T167790) [10:05:56] (03PS1) 10Volans: wmf-auto-reimage: fix bug in repool message [puppet] - 10https://gerrit.wikimedia.org/r/395709 [10:05:58] (03PS3) 10MarcoAurelio: Add Portal namespace for mwl.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394528 (https://phabricator.wikimedia.org/T180052) [10:06:05] (03PS3) 10MarcoAurelio: Add https://studiezaal.nijmegen.nl to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394379 (https://phabricator.wikimedia.org/T181713) [10:06:35] (03CR) 10Muehlenhoff: [C: 031] wmf-auto-reimage: fix bug in repool message [puppet] - 10https://gerrit.wikimedia.org/r/395709 (owner: 10Volans) [10:07:48] (03CR) 10Muehlenhoff: [C: 031] "Ack, that should fix it." [puppet] - 10https://gerrit.wikimedia.org/r/395663 (https://phabricator.wikimedia.org/T181796) (owner: 10Volans) [10:09:30] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3815381 (10Gilles) [10:12:19] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:13:18] PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:13:40] (03CR) 10Muehlenhoff: [C: 031] "One nit, looks good to me." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/395662 (https://phabricator.wikimedia.org/T181798) (owner: 10Volans) [10:18:51] (03PS2) 10Volans: wmf-auto-reimage: add --conftool-value option [puppet] - 10https://gerrit.wikimedia.org/r/395662 (https://phabricator.wikimedia.org/T181798) [10:18:53] (03PS2) 10Volans: wmf-auto-reimage: improve screen/tmux detection [puppet] - 10https://gerrit.wikimedia.org/r/395663 (https://phabricator.wikimedia.org/T181796) [10:18:55] (03PS2) 10Volans: wmf-auto-reimage: fix bug in repool message [puppet] - 10https://gerrit.wikimedia.org/r/395709 [10:19:04] (03CR) 10Volans: wmf-auto-reimage: add --conftool-value option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/395662 (https://phabricator.wikimedia.org/T181798) (owner: 10Volans) [10:23:11] (03PS1) 10Elukey: role::analytics_cluster::[hadoop::standby|hue]: fix hiera config [puppet] - 10https://gerrit.wikimedia.org/r/395710 (https://phabricator.wikimedia.org/T167790) [10:29:52] (03CR) 10Elukey: [C: 032] role::analytics_cluster::[hadoop::standby|hue]: fix hiera config [puppet] - 10https://gerrit.wikimedia.org/r/395710 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [10:31:33] (03CR) 10Muehlenhoff: [C: 031] wmf-auto-reimage: add --conftool-value option [puppet] - 10https://gerrit.wikimedia.org/r/395662 (https://phabricator.wikimedia.org/T181798) (owner: 10Volans) [10:32:47] (03PS3) 10Volans: wmf-auto-reimage: add --conftool-value option [puppet] - 10https://gerrit.wikimedia.org/r/395662 (https://phabricator.wikimedia.org/T181798) [10:33:18] RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:34:30] (03CR) 10Volans: [C: 032] wmf-auto-reimage: add --conftool-value option [puppet] - 10https://gerrit.wikimedia.org/r/395662 (https://phabricator.wikimedia.org/T181798) (owner: 10Volans) [10:34:47] (03CR) 10Volans: [C: 032] wmf-auto-reimage: improve screen/tmux detection [puppet] - 10https://gerrit.wikimedia.org/r/395663 (https://phabricator.wikimedia.org/T181796) (owner: 10Volans) [10:34:59] (03PS3) 10Volans: wmf-auto-reimage: improve screen/tmux detection [puppet] - 10https://gerrit.wikimedia.org/r/395663 (https://phabricator.wikimedia.org/T181796) [10:36:01] (03PS3) 10Volans: wmf-auto-reimage: fix bug in repool message [puppet] - 10https://gerrit.wikimedia.org/r/395709 [10:36:40] 10Operations, 10Community-Tech, 10DBA, 10MediaWiki-General-or-Unknown, and 3 others: Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#3815431 (10EddieGP) JFTR, I answered marostegui on irc yesterday: That patch isn't merged yet, it especially needs rev... [10:36:55] (03CR) 10Volans: [C: 032] wmf-auto-reimage: fix bug in repool message [puppet] - 10https://gerrit.wikimedia.org/r/395709 (owner: 10Volans) [10:37:19] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:39:41] !log mobrovac@tin Started deploy [restbase/deploy@b1d7c82]: Use Cass3 for revisions, deprecate trending-edits, fix CX end point - T179421 T180384 T173801 [10:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:55] T179421: Migrate revisions and restrictions from legacy to new storage - https://phabricator.wikimedia.org/T179421 [10:39:55] T173801: https://wikimedia.org/api/rest_v1/#!/Transform/post_transform_html_from_from_lang_to_to_lang_provider doesn't return any content - https://phabricator.wikimedia.org/T173801 [10:39:55] T180384: Turn off Trending Service - https://phabricator.wikimedia.org/T180384 [10:43:25] 10Operations, 10Trending-Service, 10Reading-Infrastructure-Team-Backlog (Kanban), 10Services (blocked): Turn off Trending Service - https://phabricator.wikimedia.org/T180384#3815448 (10mobrovac) [10:45:43] !log mobrovac@tin Finished deploy [restbase/deploy@b1d7c82]: Use Cass3 for revisions, deprecate trending-edits, fix CX end point - T179421 T180384 T173801 (duration: 06m 02s) [10:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:55] T179421: Migrate revisions and restrictions from legacy to new storage - https://phabricator.wikimedia.org/T179421 [10:45:55] T173801: https://wikimedia.org/api/rest_v1/#!/Transform/post_transform_html_from_from_lang_to_to_lang_provider doesn't return any content - https://phabricator.wikimedia.org/T173801 [10:45:56] T180384: Turn off Trending Service - https://phabricator.wikimedia.org/T180384 [10:49:08] (03PS1) 10Muehlenhoff: Fix texlive dependency for stretch onwards [puppet] - 10https://gerrit.wikimedia.org/r/395712 [10:53:38] (03PS1) 10Addshore: Enable AdvancedSearch on fawiki and huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395713 (https://phabricator.wikimedia.org/T181498) [10:55:06] !log reimaging mw2119 to stretch [10:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:20] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: mw2118.codfw.wmnet [10:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:57] (03PS1) 10Addshore: wmgUseNewWikiDiff2Extension true for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395714 (https://phabricator.wikimedia.org/T180476) [10:59:59] (03CR) 10ArielGlenn: "Have you considered looking at the session id instead? This should stay the same for all the descendents of a cron job. pkill lets you ki" [puppet] - 10https://gerrit.wikimedia.org/r/393923 (owner: 10Hoo man) [11:01:01] (03PS2) 10Giuseppe Lavagetto: environments: add environment for removing hiera autolookups [puppet] - 10https://gerrit.wikimedia.org/r/395545 (https://phabricator.wikimedia.org/T181971) [11:01:03] (03PS3) 10Giuseppe Lavagetto: standard: assume standard profile structure [puppet] - 10https://gerrit.wikimedia.org/r/395546 (https://phabricator.wikimedia.org/T181971) [11:01:05] (03PS1) 10Giuseppe Lavagetto: mediawiki: move mediawiki::web to a profile [puppet] - 10https://gerrit.wikimedia.org/r/395715 [11:01:07] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::web: explicitly set log retention days [puppet] - 10https://gerrit.wikimedia.org/r/395716 [11:01:09] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::nutcracker: explicitly set log verbosity [puppet] - 10https://gerrit.wikimedia.org/r/395717 [11:02:33] (03PS3) 10MarcoAurelio: gerritbot: bolden `merged` on Phabricator comments [puppet] - 10https://gerrit.wikimedia.org/r/394748 (https://phabricator.wikimedia.org/T181886) [11:16:11] (03CR) 10Paladox: "Note you also need to do this for the .vm template too." [puppet] - 10https://gerrit.wikimedia.org/r/394748 (https://phabricator.wikimedia.org/T181886) (owner: 10MarcoAurelio) [11:17:45] (03PS1) 10Volans: wmf-auto-reimage: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/395718 (https://phabricator.wikimedia.org/T181798) [11:18:37] (03CR) 10Muehlenhoff: [C: 031] wmf-auto-reimage: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/395718 (https://phabricator.wikimedia.org/T181798) (owner: 10Volans) [11:18:53] (03CR) 10Volans: [C: 032] wmf-auto-reimage: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/395718 (https://phabricator.wikimedia.org/T181798) (owner: 10Volans) [11:37:09] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3398639 (10EddieGP) **Afaiui* nobody cares about the database still existing but nothing pointing to it. I read it as all the worr... [11:55:12] 10Operations, 10Ops-Access-Requests: Grant jdrewniak access to stat1006 and its eventlogging DB - https://phabricator.wikimedia.org/T182187#3815686 (10Jdrewniak) [11:57:56] jouncebot next [11:57:56] In 2 hour(s) and 2 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171206T1400) [12:03:41] PROBLEM - mediawiki-installation DSH group on mw2119 is CRITICAL: Host mw2119 is not in mediawiki-installation dsh group [12:03:42] PROBLEM - Check systemd state on mw2119 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:04:00] 10Operations, 10Ops-Access-Requests: Grant jdrewniak access to stat1006 and its eventlogging DB - https://phabricator.wikimedia.org/T182187#3815719 (10phuedx) AFAICT access to stat100{4,5} should be enough here, so adding @Jdrewniak to the `analytics-privatedata-users` group is all that is required. [12:05:31] PROBLEM - nutcracker port on mw2119 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [12:07:06] (03PS3) 10ArielGlenn: move adds-changes (so-called incrementals) dump cron to dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/395612 (https://phabricator.wikimedia.org/T179942) [12:07:11] PROBLEM - nutcracker process on mw2119 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (nutcracker), command name nutcracker [12:07:42] 10Operations, 10Ops-Access-Requests: Grant jdrewniak access to stat1006 and its eventlogging DB - https://phabricator.wikimedia.org/T182187#3815726 (10Jdrewniak) oh look at that, I already have access to stat1004/5, closing the ticket then [12:08:47] 10Operations, 10Ops-Access-Requests: Grant jdrewniak access to stat1006 and its eventlogging DB - https://phabricator.wikimedia.org/T182187#3815731 (10Jdrewniak) 05Open>03Resolved [12:11:21] PROBLEM - HHVM jobrunner on mw2119 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.073 second response time [12:14:52] PROBLEM - DPKG on mw2119 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:15:52] RECOVERY - DPKG on mw2119 is OK: All packages OK [12:22:09] (03CR) 10Alexandros Kosiaris: "I haven't reviewed the 2 packages. Is ruby-mysql2 fully compatible with ruby-mysql ? Note that we are still using ruby-mysql in servermon," [puppet] - 10https://gerrit.wikimedia.org/r/391336 (owner: 10Paladox) [12:23:11] RECOVERY - nutcracker process on mw2119 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [12:23:32] RECOVERY - nutcracker port on mw2119 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [12:23:51] RECOVERY - Check systemd state on mw2119 is OK: OK - running: The system is fully operational [12:24:22] RECOVERY - HHVM jobrunner on mw2119 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.074 second response time [12:30:38] (03PS3) 10Giuseppe Lavagetto: environments: add environment for removing hiera autolookups [puppet] - 10https://gerrit.wikimedia.org/r/395545 (https://phabricator.wikimedia.org/T181971) [12:30:40] (03PS4) 10Giuseppe Lavagetto: standard: assume standard profile structure [puppet] - 10https://gerrit.wikimedia.org/r/395546 (https://phabricator.wikimedia.org/T181971) [12:30:42] (03PS2) 10Giuseppe Lavagetto: mediawiki: move mediawiki::web to a profile [puppet] - 10https://gerrit.wikimedia.org/r/395715 [12:30:44] (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::web: explicitly set log retention days [puppet] - 10https://gerrit.wikimedia.org/r/395716 [12:30:46] (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::nutcracker: explicitly set log verbosity [puppet] - 10https://gerrit.wikimedia.org/r/395717 [12:34:07] (03CR) 10Muehlenhoff: Add a Prometheus exporter for PDNS recursor (031 comment) [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/394982 (owner: 10Muehlenhoff) [12:34:14] (03CR) 10Alexandros Kosiaris: [C: 031] prometheus: add ores redis job [puppet] - 10https://gerrit.wikimedia.org/r/395569 (https://phabricator.wikimedia.org/T148637) (owner: 10Filippo Giunchedi) [12:34:16] (03PS2) 10Muehlenhoff: Add a Prometheus exporter for PDNS recursor [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/394982 [12:35:50] (03PS4) 10Alexandros Kosiaris: Move contint::package_builder to a profile [puppet] - 10https://gerrit.wikimedia.org/r/392893 (owner: 10Hashar) [12:35:55] (03CR) 10Alexandros Kosiaris: [C: 032] Move contint::package_builder to a profile [puppet] - 10https://gerrit.wikimedia.org/r/392893 (owner: 10Hashar) [12:38:30] (03PS1) 10Elukey: Add support for Real Time ingestion metrics. [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/395727 [12:38:32] (03PS1) 10Elukey: Remove unsed parameter --allowed-daemons [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/395728 [12:39:39] Hey, I have some patches that need to go before wikidata is going to wmf.11 (today) [12:39:45] https://gerrit.wikimedia.org/r/#/c/395699/ [12:40:05] https://gerrit.wikimedia.org/r/#/c/395700/1 [12:40:18] https://gerrit.wikimedia.org/r/#/c/395701/1 [12:40:25] \o [12:40:25] The last one is for apergos I think [12:42:59] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3815834 (10MarcoAurelio) @StevenJ81 Sorry then for getting you wrong. @EddieGP Yes, I think we should not try to make exotic solu... [12:43:46] I was just asking addshore about the last changeset there, whether it's been tested etc [12:44:28] 10Operations, 10Puppet, 10GerritBot, 10Patch-For-Review: Bolden the "merged" word when a patch is merged - https://phabricator.wikimedia.org/T181886#3815841 (10MarcoAurelio) It's a patch/change for operations/puppet repo. Adding #operations so they are aware of this. [12:47:21] apergos: no, but using Wikidata will defintly break as it is no longer updated [12:47:46] The 2 files are identical on the current branch / last branch [12:51:21] (03PS2) 10Elukey: Remove unsed parameter --allowed-daemons [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/395728 [12:52:56] (03PS3) 10Elukey: Remove unsed parameter --allowed-daemons [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/395728 [12:59:21] (03PS3) 10Zfilipin: WIP Update RuboCop Ruby gem [puppet] - 10https://gerrit.wikimedia.org/r/395522 (https://phabricator.wikimedia.org/T180878) [13:02:34] (03CR) 10Zfilipin: "> Giuseppe Lavagetto" [puppet] - 10https://gerrit.wikimedia.org/r/395522 (https://phabricator.wikimedia.org/T180878) (owner: 10Zfilipin) [13:03:25] (03PS4) 10Zfilipin: Update RuboCop Ruby gem [puppet] - 10https://gerrit.wikimedia.org/r/395522 (https://phabricator.wikimedia.org/T180878) [13:17:08] (03CR) 10Zfilipin: "Sorry, I am updating many repositories, I did not even notice that this one has a fairly recent version of rubocop until hashar mentioned " [puppet] - 10https://gerrit.wikimedia.org/r/395522 (https://phabricator.wikimedia.org/T180878) (owner: 10Zfilipin) [13:40:14] !log reimaging mw1260 to stretch [13:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:33] (03PS1) 10Elukey: profile::hive::server/metastore: add Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/395736 (https://phabricator.wikimedia.org/T177458) [13:49:33] o/ zeljkof [13:50:04] addshore: yo [13:50:26] Do you want to run swat today?? I can do my patches seperate again! [13:50:30] #routine [13:50:33] sure [13:50:51] feel free to start first, I can take over when you are don [13:50:52] done [13:51:59] sup addshore [13:53:48] ohia Bsadowski1 [13:56:29] (03PS2) 10Elukey: profile::hive::server/metastore: add Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/395736 (https://phabricator.wikimedia.org/T177458) [13:59:27] (03PS3) 10Elukey: profile::hive::server/metastore: add Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/395736 (https://phabricator.wikimedia.org/T177458) [13:59:50] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3816024 (10EddieGP) >>! In T169450#3815834, @MarcoAurelio wrote: > @EddieGP Is it possible to avoid a "phantom wiki" and not messi... [13:59:59] @jouncebot next [14:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: My dear minions, it's time we take the moon! Just kidding. Time for European Mid-day SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171206T1400). [14:00:04] Hauskatze and Addshore: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:10] hah [14:00:17] o/ [14:00:20] \o [14:00:27] meow [14:00:31] o/ [14:00:31] rawr [14:00:34] feed me with patches [14:00:39] I can swat today [14:00:49] addshore: want to deploy your changes first? [14:00:58] * Hauskatze groans [14:01:02] I'll quickly do my config ones yes :) [14:01:15] (03CR) 10Addshore: [C: 032] Enable AdvancedSearch on fawiki and huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395713 (https://phabricator.wikimedia.org/T181498) (owner: 10Addshore) [14:01:15] ( j/k I'm fine with that ) [14:03:30] (03PS1) 10Elukey: oozie::server: replace heap_size with a more generic jvm_opts [puppet/cdh] - 10https://gerrit.wikimedia.org/r/395740 (https://phabricator.wikimedia.org/T177458) [14:03:46] (03CR) 10ArielGlenn: [C: 032] Update dumpwikidatajson & dumpwikidatardf to not use Wikidata build [puppet] - 10https://gerrit.wikimedia.org/r/395701 (https://phabricator.wikimedia.org/T182175) (owner: 10Addshore) [14:03:58] (03PS2) 10ArielGlenn: Update dumpwikidatajson & dumpwikidatardf to not use Wikidata build [puppet] - 10https://gerrit.wikimedia.org/r/395701 (https://phabricator.wikimedia.org/T182175) (owner: 10Addshore) [14:04:25] (03PS2) 10Elukey: oozie::server: replace heap_size with a more generic jvm_opts [puppet/cdh] - 10https://gerrit.wikimedia.org/r/395740 (https://phabricator.wikimedia.org/T177458) [14:06:03] (03Merged) 10jenkins-bot: Enable AdvancedSearch on fawiki and huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395713 (https://phabricator.wikimedia.org/T181498) (owner: 10Addshore) [14:06:53] (03CR) 10jenkins-bot: Enable AdvancedSearch on fawiki and huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395713 (https://phabricator.wikimedia.org/T181498) (owner: 10Addshore) [14:08:03] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: SWAT T181498 T181329 Enable AdvancedSearch on fawiki and huwiki (duration: 00m 50s) [14:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:14] T181329: AdvancedSearch on Hungarian Wikipedia - https://phabricator.wikimedia.org/T181329 [14:08:14] T181498: Enable advanced search in Persian Wikipedia - https://phabricator.wikimedia.org/T181498 [14:08:17] (03CR) 10Addshore: [C: 032] wmgUseNewWikiDiff2Extension true for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395714 (https://phabricator.wikimedia.org/T180476) (owner: 10Addshore) [14:09:41] (03Merged) 10jenkins-bot: wmgUseNewWikiDiff2Extension true for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395714 (https://phabricator.wikimedia.org/T180476) (owner: 10Addshore) [14:09:47] (03CR) 10Zfilipin: [C: 031] Add Portal namespace for mwl.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394528 (https://phabricator.wikimedia.org/T180052) (owner: 10MarcoAurelio) [14:10:03] (03CR) 10jenkins-bot: wmgUseNewWikiDiff2Extension true for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395714 (https://phabricator.wikimedia.org/T180476) (owner: 10Addshore) [14:10:16] (03PS4) 10Elukey: profile::hive::server/metastore: add Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/395736 (https://phabricator.wikimedia.org/T177458) [14:10:46] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3816074 (10MarcoAurelio) @EddieGP I guess 1 was https://gerrit.wikimedia.org/r/#/c/393289/ and 2 is https://gerrit.wikimedia.org/r... [14:10:57] (03CR) 10Zfilipin: [C: 031] "Script to run?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394528 (https://phabricator.wikimedia.org/T180052) (owner: 10MarcoAurelio) [14:11:23] (03PS4) 10Zfilipin: Add Portal namespace for mwl.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394528 (https://phabricator.wikimedia.org/T180052) (owner: 10MarcoAurelio) [14:12:14] Hauskatze: this is the correct script for 394528? mwscript namespaceDupes.php mwlwiki --fix [14:12:24] zeljkof, yes [14:12:42] ok, thanks, please stand by, you are next, as soon as addshore is done [14:12:43] syncing my last one now [14:13:07] zeljkof, this is also because we've changed several namespaces there recently so it'd be good to clean after them [14:13:10] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: SWAT T180476 wmgUseNewWikiDiff2Extension true for dewiki (duration: 00m 48s) [14:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:20] T180476: Enable the showing of changes in moved paragraphs on de-wiki - https://phabricator.wikimedia.org/T180476 [14:13:22] zeljkof: thats me done [14:13:25] zeljkof, please dry-run first, then see if there's anything to fix [14:13:37] after merging the patch :) [14:13:38] addshore: thanks, taking over swat [14:13:43] Hauskatze: will do [14:13:48] * Hauskatze shuts t.f.u. now [14:14:03] (03PS5) 10Elukey: profile::hive::server/metastore: add Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/395736 (https://phabricator.wikimedia.org/T177458) [14:14:36] 10Operations, 10wikidiff2, 10Patch-For-Review, 10User-Addshore, and 2 others: Update and use php-wikidiff2 1.5.1 & MovedParagraphDetectionCutoff in production - https://phabricator.wikimedia.org/T177891#3816086 (10Addshore) [14:15:07] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394528 (https://phabricator.wikimedia.org/T180052) (owner: 10MarcoAurelio) [14:15:43] Hauskatze: I don't see dryrun option at https://www.mediawiki.org/wiki/Manual:NamespaceDupes.php [14:15:56] zeljkof, without --fix [14:16:08] Hauskatze: ah :) [14:16:10] but after the change is sync [14:16:21] otherwise there's no point I think? [14:16:22] sure, waiting for CI [14:16:27] (03Merged) 10jenkins-bot: Add Portal namespace for mwl.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394528 (https://phabricator.wikimedia.org/T180052) (owner: 10MarcoAurelio) [14:16:39] :) [14:16:57] (03CR) 10jenkins-bot: Add Portal namespace for mwl.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394528 (https://phabricator.wikimedia.org/T180052) (owner: 10MarcoAurelio) [14:17:43] Hauskatze: so I need to do a full deploy, then run the script? or just deployment to mwdebug1002 and then running the script? [14:17:55] I'd say a full deploy. [14:18:02] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/9203/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/395736 (https://phabricator.wikimedia.org/T177458) (owner: 10Elukey) [14:18:10] I can check the Portal one in mwdebug though [14:18:16] if it's correctly added, etc [14:19:45] Hauskatze: ok, 394528 is at mwdebug1002, please check [14:19:47] once Portal is on mwdebug pls let me know [14:19:49] heh [14:19:55] if it looks good I will deploy and run the script [14:20:46] yep, I can see Portal and Cumbersa_portal [14:20:58] ok to deploy? [14:21:04] ok from me [14:21:18] deploying... [14:21:41] (03CR) 10Elukey: [C: 032] profile::hive::server/metastore: add Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/395736 (https://phabricator.wikimedia.org/T177458) (owner: 10Elukey) [14:22:20] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:394528|Add Portal namespace for mwl.wikipedia (T180052)]] (duration: 00m 48s) [14:22:24] zeljkof: I have 2 more that I'll do after you! [14:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:30] #sneaky [14:22:32] T180052: Creation and update of namespaces in Mirandese Wikipedia (mwlwiki) - https://phabricator.wikimedia.org/T180052 [14:22:33] Hauskatze: deployed, running the script [14:22:36] addshore: :) [14:22:56] zeljkof, thanks; please let me know if there are conflicts [14:23:12] Hauskatze: there are, pasting the output to task [14:23:20] thanks [14:23:46] addshore: that change is merged but it won't be active until the cron jobs on the host are re-enabled, which should be for next week's run (this is to do with the memory leak issue) [14:24:02] apergos: okay! [14:24:15] Hauskatze: https://phabricator.wikimedia.org/T180052#3816108 [14:24:38] let me know if I should run the script with --fix, or if you need to fix the conflicts first [14:24:51] (03CR) 10Elukey: [C: 032] oozie::server: replace heap_size with a more generic jvm_opts [puppet/cdh] - 10https://gerrit.wikimedia.org/r/395740 (https://phabricator.wikimedia.org/T177458) (owner: 10Elukey) [14:25:33] zeljkof, I'll check with dereckson if possible, but for now we can move to the next patch [14:25:52] Hauskatze: should I do a full deploy of the first one first? [14:26:21] zeljkof, the last one I cannot really test it so I guess a full deploy it's possible [14:26:51] Hauskatze: sorry, did not understand you [14:27:04] zeljkof, the $wgCopyUploads one [14:27:09] you can deploy that [14:27:11] my question was, should I do a full deploy of 394528? [14:27:31] before merging 394379 [14:27:31] ah, yes sure [14:27:39] though you had already [14:27:44] Hauskatze: ok, deploying, let me know when to run the script [14:27:55] [2017-12-06T14:22:20Z] Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:394528|Add Portal namespace for mwl.wikipedia (T180052)]] (duration: 00m 48s) [14:27:55] T180052: Creation and update of namespaces in Mirandese Wikipedia (mwlwiki) - https://phabricator.wikimedia.org/T180052 [14:28:39] Hauskatze: ah, sorry, got confused, I _did_ already deploy 394528, nevermind me... [14:28:58] we'll fix the conflicts later [14:29:05] now the *79 [14:29:57] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394379 (https://phabricator.wikimedia.org/T181713) (owner: 10MarcoAurelio) [14:30:29] (03PS4) 10Zfilipin: Add https://studiezaal.nijmegen.nl to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394379 (https://phabricator.wikimedia.org/T181713) (owner: 10MarcoAurelio) [14:30:37] (03CR) 10Zfilipin: Add https://studiezaal.nijmegen.nl to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394379 (https://phabricator.wikimedia.org/T181713) (owner: 10MarcoAurelio) [14:30:37] (03CR) 10Zfilipin: [C: 032] Add https://studiezaal.nijmegen.nl to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394379 (https://phabricator.wikimedia.org/T181713) (owner: 10MarcoAurelio) [14:31:38] Hauskatze: ok, so you can not test 394379, so I should do a full deploy immediately? [14:31:58] (03Merged) 10jenkins-bot: Add https://studiezaal.nijmegen.nl to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394379 (https://phabricator.wikimedia.org/T181713) (owner: 10MarcoAurelio) [14:31:58] zeljkof, that's correct; I'll ask the requestor to confirm [14:32:11] (03CR) 10jenkins-bot: Add https://studiezaal.nijmegen.nl to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394379 (https://phabricator.wikimedia.org/T181713) (owner: 10MarcoAurelio) [14:33:17] (03PS1) 10Elukey: profile::oozie::server: add Prometheus monitoring and tune jvm [puppet] - 10https://gerrit.wikimedia.org/r/395741 (https://phabricator.wikimedia.org/T177458) [14:33:26] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:394379|Add https://studiezaal.nijmegen.nl to $wgCopyUploadsDomains (T181713)]] (duration: 00m 47s) [14:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:37] T181713: Please add https//studiezaal.nijmegen.nl to $wgCopyUploadsDomains - https://phabricator.wikimedia.org/T181713 [14:34:13] Hauskatze: 394379 is deployed; I do not need to run the script for 394528 with --fix yet? [14:35:18] 10Operations, 10Puppet, 10Wikidata, 10User-Addshore: operations-puppet still uses the wikidata build in places - https://phabricator.wikimedia.org/T182176#3816120 (10Addshore) [14:35:21] 10Operations, 10Puppet, 10Wikidata, 10Patch-For-Review, 10User-Addshore: dumpwikidatajson.sh and dumpwikidatardf.sh still use the wikidata build in puppet - https://phabricator.wikimedia.org/T182175#3816119 (10Addshore) 05Open>03Resolved [14:35:22] zeljkof, there are some pages that output as "invalid" that I'd like to have a look, but you can --fix I guess and I'll work with the non-automatic-fixable ones [14:35:46] PROBLEM - MD RAID on mw1260 is CRITICAL: Return code of 255 is out of bounds [14:35:55] Hauskatze: ok, running the script [14:36:02] thanks zeljkof [14:36:47] zeljkof: ping me when i can continue my bits please :) [14:36:57] addshore: go go go! :D [14:37:02] wheee thanks [14:37:27] Hauskatze: https://phabricator.wikimedia.org/T180052#3816122 and thanks for deploying with #releng! ;) [14:37:28] * addshore waits for CI [14:37:44] addshore: feel free to close the window when done [14:37:49] will do [14:38:24] (03PS2) 10Elukey: profile::oozie::server: add Prometheus monitoring and tune jvm [puppet] - 10https://gerrit.wikimedia.org/r/395741 (https://phabricator.wikimedia.org/T177458) [14:38:39] zeljkof, one more thing, can we run it **without** --fix this time to see how many do we have left to fix? [14:38:54] Hauskatze: sure [14:38:57] so only the conflicting ones appear [14:39:00] thanks and sorry [14:39:07] no problem [14:39:12] I should have thought of that :) [14:39:20] (people forget about running this script so often) [14:40:03] Hauskatze: here it is https://phabricator.wikimedia.org/T180052#3816125 [14:40:16] thanks! [14:40:20] 5 and 14 [14:40:24] that's much better [14:40:34] now I can check those and fix them on-wiki [14:41:54] (03PS3) 10Elukey: profile::oozie::server: add Prometheus monitoring and tune jvm [puppet] - 10https://gerrit.wikimedia.org/r/395741 (https://phabricator.wikimedia.org/T177458) [14:42:45] PROBLEM - mediawiki-installation DSH group on mw1260 is CRITICAL: Host mw1260 is not in mediawiki-installation dsh group [14:43:52] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9207/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/395741 (https://phabricator.wikimedia.org/T177458) (owner: 10Elukey) [14:44:25] *waits* [14:45:54] (03PS4) 10ArielGlenn: move adds-changes (so-called incrementals) dump cron to dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/395612 (https://phabricator.wikimedia.org/T179942) [14:48:35] PROBLEM - Oozie Server on analytics1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.catalina.startup.Bootstrap [14:48:46] PROBLEM - HHVM jobrunner on mw1260 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:49:27] (03PS1) 10Elukey: profile::oozie::monitoring: add missing configuration file. [puppet] - 10https://gerrit.wikimedia.org/r/395742 (https://phabricator.wikimedia.org/T177458) [14:49:29] checking oozie.. it's my fault [14:49:43] (03CR) 10Elukey: [C: 032] profile::oozie::monitoring: add missing configuration file. [puppet] - 10https://gerrit.wikimedia.org/r/395742 (https://phabricator.wikimedia.org/T177458) (owner: 10Elukey) [14:50:25] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/oozie/prometheus_oozie_server_jmx_exporter.yaml] [14:53:50] !log addshore@tin Synchronized php-1.31.0-wmf.10/extensions/Wikibase/repo/includes/ChangeDispatcher.php: SWAT [[gerrit:395737|Tracking within ChangeDispatcher::getPendingChanges]] (duration: 00m 49s) [14:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:19] zeljkof, I've fixed the 5 conflicts, the other seems to be pagelinks, I've asked around [14:54:32] wiki is working fine so no worries [14:55:23] (03PS5) 10ArielGlenn: move adds-changes (so-called incrementals) dump cron to dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/395612 (https://phabricator.wikimedia.org/T179942) [14:55:25] RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:57:30] last one syncing [14:57:56] PROBLEM - Hue Server on thorium is CRITICAL: PROCS CRITICAL: 2 processes with command name python2.7, args /usr/lib/hue/build/env/bin/hue [14:57:58] !log addshore@tin Synchronized php-1.31.0-wmf.11/extensions/Wikibase/repo/includes/ChangeDispatcher.php: SWAT [[gerrit:395738|Tracking within ChangeDispatcher::getPendingChanges]] (duration: 00m 49s) [14:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:58] Hauskatze: cool [14:59:17] !log swat done [14:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:56] RECOVERY - Hue Server on thorium is OK: PROCS OK: 1 process with command name python2.7, args /usr/lib/hue/build/env/bin/hue [15:01:15] PROBLEM - Check systemd state on mw1260 is CRITICAL: Return code of 255 is out of bounds [15:01:16] PROBLEM - dhclient process on mw1260 is CRITICAL: Return code of 255 is out of bounds [15:01:16] PROBLEM - Disk space on mw1260 is CRITICAL: Return code of 255 is out of bounds [15:01:26] PROBLEM - DPKG on mw1260 is CRITICAL: Return code of 255 is out of bounds [15:01:26] PROBLEM - nutcracker process on mw1260 is CRITICAL: Return code of 255 is out of bounds [15:01:26] PROBLEM - configured eth on mw1260 is CRITICAL: Return code of 255 is out of bounds [15:01:26] PROBLEM - Check size of conntrack table on mw1260 is CRITICAL: Return code of 255 is out of bounds [15:01:26] PROBLEM - nutcracker port on mw1260 is CRITICAL: Return code of 255 is out of bounds [15:01:26] PROBLEM - Check whether ferm is active by checking the default input chain on mw1260 is CRITICAL: Return code of 255 is out of bounds [15:02:03] moritzm: FYI ^^^ [15:02:05] PROBLEM - MD RAID on mw1260 is CRITICAL: Return code of 255 is out of bounds [15:03:05] RECOVERY - MD RAID on mw1260 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [15:03:15] RECOVERY - Disk space on mw1260 is OK: DISK OK [15:03:16] RECOVERY - dhclient process on mw1260 is OK: PROCS OK: 0 processes with command name dhclient [15:03:26] RECOVERY - Check size of conntrack table on mw1260 is OK: OK: nf_conntrack is 0 % full [15:03:26] RECOVERY - configured eth on mw1260 is OK: OK - interfaces up [15:03:26] RECOVERY - Check whether ferm is active by checking the default input chain on mw1260 is OK: OK ferm input default policy is set [15:03:56] (03PS6) 10ArielGlenn: move adds-changes (so-called incrementals) dump cron to dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/395612 (https://phabricator.wikimedia.org/T179942) [15:05:26] RECOVERY - DPKG on mw1260 is OK: All packages OK [15:06:14] yep, it's being reimaged, silencing [15:06:52] (03PS2) 10Dzahn: Make wikidatawiki cronjobs use the Wikibase extension and not the build [puppet] - 10https://gerrit.wikimedia.org/r/395699 (https://phabricator.wikimedia.org/T182159) (owner: 10Addshore) [15:07:06] hi mutante! [15:07:15] im ready to watch that one go out if you merge it :) [15:07:22] (03CR) 10Dzahn: [C: 032] "important because it was set to UBN" [puppet] - 10https://gerrit.wikimedia.org/r/395699 (https://phabricator.wikimedia.org/T182159) (owner: 10Addshore) [15:08:35] RECOVERY - Oozie Server on analytics1003 is OK: PROCS OK: 1 process with command name java, args org.apache.catalina.startup.Bootstrap [15:10:21] addshore: hi! yes, just changed it on terbium.. done [15:10:45] thanks! [15:11:11] (03CR) 10Alexandros Kosiaris: [C: 032] Update to 1.7.10 [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/393805 (https://phabricator.wikimedia.org/T181489) (owner: 10Alexandros Kosiaris) [15:11:37] 10Operations, 10Puppet, 10Wikidata, 10Patch-For-Review, 10User-Addshore: dispatchChanges.php still being run from the Wikidata build - https://phabricator.wikimedia.org/T182159#3816187 (10Dzahn) applied on terbium ``` [terbium:~] $ sudo crontab -u www-data -l | grep wikidata 0,15,30,45 * * * * /usr/loca... [15:11:45] RECOVERY - HHVM jobrunner on mw1260 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [15:12:21] mutante: script started just fine, thanks! [15:12:34] 10Operations, 10Puppet, 10Wikidata, 10User-Addshore: operations-puppet still uses the wikidata build in places - https://phabricator.wikimedia.org/T182176#3816190 (10Addshore) [15:12:38] 10Operations, 10Puppet, 10Wikidata, 10Patch-For-Review, 10User-Addshore: dispatchChanges.php still being run from the Wikidata build - https://phabricator.wikimedia.org/T182159#3816189 (10Addshore) 05Open>03Resolved [15:12:41] 10Operations, 10Puppet, 10Wikidata, 10User-Addshore: operations-puppet still uses the wikidata build in places - https://phabricator.wikimedia.org/T182176#3815207 (10Addshore) 05Open>03Resolved [15:13:02] !log upload kubernetes_1.7.10-1_amd64 on apt.wikimedia.org/stretch-wikimedia/main T181489 [15:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:14] T181489: Gaps in kubelet-reported Prometheus metrics - https://phabricator.wikimedia.org/T181489 [15:13:17] (03CR) 10Dzahn: [C: 04-1] "per Paladox' comment this would just affect Gerrit 2.14 but not the current version, afaict" [puppet] - 10https://gerrit.wikimedia.org/r/394748 (https://phabricator.wikimedia.org/T181886) (owner: 10MarcoAurelio) [15:13:22] addshore: nice :) [15:13:39] (03PS1) 10Elukey: role::analytics_cluster::coordinator: remove Oozie Prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/395745 (https://phabricator.wikimedia.org/T177458) [15:13:56] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.003 second response time [15:14:02] (03CR) 10Dzahn: [C: 04-1] "unless we are upgrading soon anyways and it's good enough for you when it is changed there" [puppet] - 10https://gerrit.wikimedia.org/r/394748 (https://phabricator.wikimedia.org/T181886) (owner: 10MarcoAurelio) [15:14:19] (03PS2) 10Elukey: role::analytics_cluster::coordinator: remove Oozie Prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/395745 (https://phabricator.wikimedia.org/T177458) [15:14:57] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1547 bytes in 1.957 second response time [15:14:57] (03CR) 10Elukey: [C: 032] role::analytics_cluster::coordinator: remove Oozie Prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/395745 (https://phabricator.wikimedia.org/T177458) (owner: 10Elukey) [15:15:08] (03CR) 10Dzahn: [C: 032] contint: libcurl4-gnutls-dev is now absent [puppet] - 10https://gerrit.wikimedia.org/r/392928 (owner: 10Hashar) [15:16:34] 10Operations, 10Goal, 10User-fgiunchedi, 10cloud-services-team (Kanban): Create ircd exporter for Prometheus - https://phabricator.wikimedia.org/T182196#3816206 (10MoritzMuehlenhoff) p:05Triage>03High [15:17:37] (03PS1) 10Elukey: oozie::server: avoid auto-restarts when config files change [puppet/cdh] - 10https://gerrit.wikimedia.org/r/395746 [15:22:03] (03PS2) 10Elukey: oozie::server: avoid auto-restarts when config files change [puppet/cdh] - 10https://gerrit.wikimedia.org/r/395746 [15:23:22] (03CR) 10Ottomata: [V: 032 C: 032] oozie::server: avoid auto-restarts when config files change [puppet/cdh] - 10https://gerrit.wikimedia.org/r/395746 (owner: 10Elukey) [15:24:26] RECOVERY - nutcracker port on mw1260 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [15:24:26] RECOVERY - nutcracker process on mw1260 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [15:25:35] (03PS1) 10Alexandros Kosiaris: Pass enable=>true for all kubernetes components [puppet] - 10https://gerrit.wikimedia.org/r/395747 [15:26:54] (03PS2) 10Alexandros Kosiaris: Pass enable=>true for all kubernetes components [puppet] - 10https://gerrit.wikimedia.org/r/395747 [15:27:42] (03CR) 10Thiemo Mättig (WMDE): [C: 031] Update location of wikibase-rebuildTermSqlIndex script [puppet] - 10https://gerrit.wikimedia.org/r/395700 (owner: 10Addshore) [15:27:59] (03PS1) 10Elukey: modules::cdh: update to latest change [puppet] - 10https://gerrit.wikimedia.org/r/395748 [15:28:06] PROBLEM - Check systemd state on kubestage1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:28:06] PROBLEM - Check systemd state on kubestage1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:29:01] (03CR) 10Thiemo Mättig (WMDE): [C: 031] Switch to extension.json for Wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395488 (owner: 10Addshore) [15:30:04] (03CR) 10Thiemo Mättig (WMDE): [C: 031] Switch to extension.json for PropertySuggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395486 (owner: 10Addshore) [15:30:20] (03CR) 10Alexandros Kosiaris: [C: 032] Pass enable=>true for all kubernetes components [puppet] - 10https://gerrit.wikimedia.org/r/395747 (owner: 10Alexandros Kosiaris) [15:30:43] (03CR) 10Thiemo Mättig (WMDE): [C: 031] Switch to extension.json for WikibaseQuality extensions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395487 (owner: 10Addshore) [15:31:05] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler03/9212/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/395748 (owner: 10Elukey) [15:31:12] (03PS2) 10Elukey: modules::cdh: update to latest change [puppet] - 10https://gerrit.wikimedia.org/r/395748 [15:31:19] * elukey got +2 sniped [15:33:06] RECOVERY - Check systemd state on kubestage1002 is OK: OK - running: The system is fully operational [15:33:06] RECOVERY - Check systemd state on kubestage1001 is OK: OK - running: The system is fully operational [15:38:45] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Internal Server Error [15:39:05] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Internal Server Error [15:39:15] PROBLEM - Check systemd state on kubernetes1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:39:55] PROBLEM - Check systemd state on kubernetes2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:40:05] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [15:40:15] RECOVERY - Check systemd state on kubernetes1002 is OK: OK - running: The system is fully operational [15:40:45] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [15:40:55] RECOVERY - Check systemd state on kubernetes2002 is OK: OK - running: The system is fully operational [15:42:50] (03PS1) 10Muehlenhoff: Add .gitreview file [debs/prometheus-ircd-exporter] - 10https://gerrit.wikimedia.org/r/395749 [15:42:55] RECOVERY - Check systemd state on mw1260 is OK: OK - running: The system is fully operational [15:43:20] (03PS2) 10Tpt: Properly setup ProofreadPage namespaces for cywikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394189 (https://phabricator.wikimedia.org/T181406) [15:44:43] (03CR) 10Muehlenhoff: [V: 032 C: 032] Add .gitreview file [debs/prometheus-ircd-exporter] - 10https://gerrit.wikimedia.org/r/395749 (owner: 10Muehlenhoff) [15:46:23] (03PS1) 10Muehlenhoff: Add a prometheus exporter for ircd [debs/prometheus-ircd-exporter] - 10https://gerrit.wikimedia.org/r/395751 [15:47:17] (03PS3) 10Aklapper: Fix linewrap issue on wikimedia error page [puppet] - 10https://gerrit.wikimedia.org/r/395552 (https://phabricator.wikimedia.org/T180656) (owner: 10Phantom42) [15:47:48] (03CR) 10Aklapper: [C: 031] "Thanks! Tested locally and works as expected. (I don't have +2 rights on this repository.)" [puppet] - 10https://gerrit.wikimedia.org/r/395552 (https://phabricator.wikimedia.org/T180656) (owner: 10Phantom42) [15:48:51] (03PS2) 10Ema: vcl: add hostname/layer info to syntethic healthcheck response [puppet] - 10https://gerrit.wikimedia.org/r/393251 [15:49:24] 10Operations, 10RESTBase-Cassandra, 10Services (later): cassandra slow streaming during (de)commission - https://phabricator.wikimedia.org/T126619#3816371 (10Eevans) Some additional information: [[ https://issues.apache.org/jira/browse/CASSANDRA-4663 | CASSANDRA-4663 ]] adds concurrency at the keyspace leve... [15:51:15] bah, why do we clean up branches for core and extensions [15:51:18] thats super annoying [15:51:36] rm -rf * [15:51:42] (03PS1) 10Muehlenhoff: Add Prometheus exporter to role::mw_rc_irc [puppet] - 10https://gerrit.wikimedia.org/r/395766 (https://phabricator.wikimedia.org/T182196) [15:55:04] (03PS1) 10Muehlenhoff: Add Prometheus scraper config for ircd exporter [puppet] - 10https://gerrit.wikimedia.org/r/395767 (https://phabricator.wikimedia.org/T182196) [16:00:05] anomie: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Maintenance script . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171206T1600). [16:00:06] No GERRIT patches in the queue for this window AFAICS. [16:00:32] (03PS7) 10ArielGlenn: move adds-changes (so-called incrementals) dump cron to dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/395612 (https://phabricator.wikimedia.org/T179942) [16:00:35] That's right, jouncebot, it's a window for running a maintenance script. [16:00:56] 10Operations, 10Scoring-platform-team, 10Wikimedia-Incident: Create an incident report for ORES overload incident 2017 - https://phabricator.wikimedia.org/T181795#3816414 (10akosiaris) [16:01:39] 10Operations, 10ORES, 10Scoring-platform-team, 10Wikimedia-Incident: Create an incident report for ORES overload incident 2017 - https://phabricator.wikimedia.org/T181795#3802542 (10akosiaris) [16:01:40] (03CR) 10ArielGlenn: [C: 032] move adds-changes (so-called incrementals) dump cron to dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/395612 (https://phabricator.wikimedia.org/T179942) (owner: 10ArielGlenn) [16:02:21] (03PS4) 10Gehel: Revert "Revert "Deploy MjoLniR with new deploy repository"" [puppet] - 10https://gerrit.wikimedia.org/r/394002 (owner: 10EBernhardson) [16:03:11] !log anomie@terbium Running cleanupUsersWithNoId.php for testwiki, see T181731 [16:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:21] T181731: Run maintenance/cleanupUsersWithNoId.php on all wikis - https://phabricator.wikimedia.org/T181731 [16:03:55] (03PS4) 10Dzahn: Move contint::hhvm to a profile [puppet] - 10https://gerrit.wikimedia.org/r/392925 (owner: 10Hashar) [16:05:16] (03CR) 10Dzahn: [C: 032] Move contint::hhvm to a profile [puppet] - 10https://gerrit.wikimedia.org/r/392925 (owner: 10Hashar) [16:05:49] (03PS3) 10Dzahn: contint: libcurl4-gnutls-dev is now absent [puppet] - 10https://gerrit.wikimedia.org/r/392928 (owner: 10Hashar) [16:11:49] !log anomie@terbium Finished cleanupUsersWithNoId.php on testwiki [16:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:38] !log anomie@terbium Running cleanupUsersWithNoId.php for test2wiki, see T181731 [16:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:49] T181731: Run maintenance/cleanupUsersWithNoId.php on all wikis - https://phabricator.wikimedia.org/T181731 [16:16:01] 10Operations, 10ORES, 10Scoring-platform-team: Tuning profile::ores::celery parameters should cause a Celery service restart - https://phabricator.wikimedia.org/T182203#3816483 (10awight) [16:26:28] !log anomie@terbium Finished cleanupUsersWithNoId.php on test2wiki [16:26:29] (03PS1) 10MarcoAurelio: Allow eswiki bureaucrats to add/remove 'accountcreator' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395775 (https://phabricator.wikimedia.org/T182201) [16:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:00] !log anomie@terbium Running cleanupUsersWithNoId.php for testwikidatawiki, see T181731 [16:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:09] T181731: Run maintenance/cleanupUsersWithNoId.php on all wikis - https://phabricator.wikimedia.org/T181731 [16:28:36] !log anomie@terbium Finished cleanupUsersWithNoId.php on testwikidatawiki [16:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:37] !log anomie@terbium Running cleanupUsersWithNoId.php for mediawikiwiki, see T181731 [16:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:12] anomie, is class CleanupUsersWithNoId extends LoggedUpdateMaintenance { what makes the bot to log both the start and the end of the running of this script? [16:32:35] Hauskatze: No, I'm doing that manually. [16:32:55] ah, 'cause I saw logmsgbot saying that [16:33:02] no idea you could speak for the bot :) [16:33:32] On terbium, you can do "scap log 'some message here'" and the bot will report it to this channel for the other bot to react to. [16:33:52] cool [16:33:53] !log anomie@terbium Finished cleanupUsersWithNoId.php on mediawikiwiki [16:33:54] thanks [16:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:43] And that's it for cleanupUsersWithNoId.php for today. If any of those four wikis start raising errors about usernames containing the invalid ">" character, let me know. Assuming no error reports, I'll be running all the other wikis next week (without the logspam). [16:35:45] (03PS5) 10Gehel: Revert "Revert "Deploy MjoLniR with new deploy repository"" [puppet] - 10https://gerrit.wikimedia.org/r/394002 (owner: 10EBernhardson) [16:37:17] (03CR) 10Gehel: [C: 032] Revert "Revert "Deploy MjoLniR with new deploy repository"" [puppet] - 10https://gerrit.wikimedia.org/r/394002 (owner: 10EBernhardson) [16:44:10] (03PS2) 10Gehel: logstash: dedicated components in our APT repository [puppet] - 10https://gerrit.wikimedia.org/r/390433 (https://phabricator.wikimedia.org/T179964) [16:48:53] @jouncebot now [16:49:14] jouncebot: now [16:49:14] For the next 0 hour(s) and 10 minute(s): Maintenance script (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171206T1600) [16:49:38] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/390433 (https://phabricator.wikimedia.org/T179964) (owner: 10Gehel) [16:52:19] anomie, apparently scap log on -prep does nothing :) [16:52:32] (deployment-tin I mean) [16:53:01] Hauskatze: Yeah, it does nothing in Beta Cluster. You have to go to #wikimedia-cloud and type the "!log" yourself. [16:53:01] cleaned up some spam while I was at it [16:53:35] (03CR) 10Elukey: [V: 032 C: 032] Add support for Real Time ingestion metrics. [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/395727 (owner: 10Elukey) [16:53:47] btw: lol: https://deployment.wikimedia.beta.wmflabs.org/w/index.php?title=Special:Log/block&page=User%3AAddshore [16:54:13] (03CR) 10Elukey: [V: 032 C: 032] Remove unsed parameter --allowed-daemons [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/395728 (owner: 10Elukey) [17:00:46] bawolff, spambots at beta cluster are having their emails verified :/ [17:01:44] :/ [17:01:50] Well its not that hard to verify an email [17:03:53] bawolff, can I PM [17:03:55] ? [17:04:07] If you want [17:04:10] thanks [17:05:26] !log Started Wikidata RDF dumps (sudo -b -u datasets bash -c 'dumpwikidatardf.sh all ttl; dumpwikidatardf.sh truthy nt') on snapshot1007 [17:05:30] apergos: FYI ^ [17:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:06] hoo: did you check the location of the... [17:06:09] um [17:06:28] location? [17:06:43] https://gerrit.wikimedia.org/r/#/c/395701/ this is merged but not deployed [17:07:01] and the reason it's not deployed is that the cron jobs are commented out on the snapshot host (in the puppet manifest) [17:07:17] until you revert it and poke me, presumably after this run completes [17:07:39] so just to make sure that the script is actually running ok? I guess that there will be a deploy later tonight to remove... [17:07:49] some large directory tree? [17:07:57] addshore: help me out here :-D [17:08:05] *reads up* [17:08:22] Yeah, we're indeed running extensions/Wikidata/extensions/Wikibase/repo/maintenance/dumpRdf.php [17:09:07] does any of that code re-open php files? if it's read into memory then you're ok, it will be unlinked but survive til the end of the run [17:09:24] but if it tries to read any of the scripts etc later, then... ewww [17:10:14] This should be fine… we have everything in memory now, I suppose [17:10:31] :D [17:10:42] (03CR) 10Gehel: [C: 032] logstash: dedicated components in our APT repository [puppet] - 10https://gerrit.wikimedia.org/r/390433 (https://phabricator.wikimedia.org/T179964) (owner: 10Gehel) [17:10:49] ok great [17:11:14] I will probably be around only sporadically tonight fyi [17:11:37] this should be fine… worst case I just kill this and we defer the creation for a day [17:12:42] k [17:13:35] (03CR) 10Chad: "It's on my radar." [puppet] - 10https://gerrit.wikimedia.org/r/394748 (https://phabricator.wikimedia.org/T181886) (owner: 10MarcoAurelio) [17:23:09] 10Operations, 10Wikimedia-Logstash, 10hardware-requests: decommission logstash100[1-3] - https://phabricator.wikimedia.org/T175830#3816743 (10Gehel) [17:27:18] (03CR) 10Hoo man: "> do they all run as the same user? does that user run other stuff? thinking of "killall -u username" for this" [puppet] - 10https://gerrit.wikimedia.org/r/393923 (owner: 10Hoo man) [17:30:04] mobrovac and Pchelolo: Time to snap out of that daydream and deploy JobQueue Job Migration. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171206T1730). [17:30:04] No GERRIT patches in the queue for this window AFAICS. [17:30:42] (03CR) 10Hoo man: Fix killing dumpers in Wikidata entity dumpers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/393923 (owner: 10Hoo man) [17:32:08] 10Operations: install_server: switch to stretch as default install image - https://phabricator.wikimedia.org/T182215#3816772 (10Dzahn) [17:34:14] (03PS2) 10Hoo man: Fix killing dumpers in Wikidata entity dumpers [puppet] - 10https://gerrit.wikimedia.org/r/393923 [17:34:18] 10Operations: install_server: switch to stretch as default install image - https://phabricator.wikimedia.org/T182215#3816788 (10Dzahn) p:05Triage>03Normal [17:34:34] (03PS3) 10Mobrovac: Disable producing htmlCacheUpdate to redis for wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395616 (https://phabricator.wikimedia.org/T182023) (owner: 10Ppchelko) [17:34:48] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10Patch-For-Review: kafka1018 fails to boot - https://phabricator.wikimedia.org/T181518#3816790 (10madhuvishy) @elukey I recommend copying home directories on notebook1002 and back them up somewhere on notebook1001, and send a note to analytics and research-l... [17:36:06] (03CR) 10Hoo man: "Note: The new version is tested as well" [puppet] - 10https://gerrit.wikimedia.org/r/393923 (owner: 10Hoo man) [17:36:21] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: mw1260.eqiad.wmnet [17:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:35] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3816809 (10Halfak) [17:37:25] 10Operations, 10ORES, 10Scoring-platform-team, 10Release-Engineering-Team (Kanban), and 2 others: Git refusing to clone some ORES submodules - https://phabricator.wikimedia.org/T181552#3816812 (10Halfak) [17:37:39] @jouncebot now [17:37:48] jouncebot: why u so quiet [17:37:56] @jouncebot next [17:38:10] <_joe_> jouncebot: next [17:38:10] In 1 hour(s) and 21 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171206T1900) [17:38:15] <_joe_> without the @ [17:38:20] * addshore facepalms [17:38:23] <_joe_> this ain't twitter :D [17:38:30] for some reason I think I am in phabricator or twitter... [17:38:37] (03CR) 10Hashar: "Nodepool images are adjusted via https://gerrit.wikimedia.org/r/#/c/392926/" [puppet] - 10https://gerrit.wikimedia.org/r/392925 (owner: 10Hashar) [17:38:44] <_joe_> addshore: don't you *hate* they're doing this to us? [17:38:45] <_joe_> :P [17:38:49] yup [17:38:53] my muscle memory is all screwed up [17:39:10] <_joe_> slack has the same issue too [17:39:17] Among others :p [17:39:20] <_joe_> but I guess you've been lucky enough not to use it [17:39:26] jouncebot next [17:39:27] In 1 hour(s) and 20 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171206T1900) [17:39:50] I thought the : mattered. [17:39:57] * addshore has 2 tiny patches he wants to sneak through [17:40:23] * addshore also seems to type /mw lots instead of /me nowerdayysss [17:42:01] !log ppchelko@tin Started deploy [cpjobqueue/deploy@3281df1]: Switch htmlCacheUpdate for wiktionaries T182023 [17:42:08] <_joe_> \o/ [17:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:11] T182023: Migrate htmlCacheUpdate job to Kafka - https://phabricator.wikimedia.org/T182023 [17:42:45] RECOVERY - mediawiki-installation DSH group on mw1260 is OK: OK [17:44:35] PROBLEM - Check systemd state on scb2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:44:58] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@3281df1]: Switch htmlCacheUpdate for wiktionaries T182023 (duration: 02m 57s) [17:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:13] ^^ my fault, fixing [17:50:58] !log ppchelko@tin Started deploy [cpjobqueue/deploy@df72b34]: Switch htmlCacheUpdate for wiktionaries, attempt 2 T182023 [17:51:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:08] T182023: Migrate htmlCacheUpdate job to Kafka - https://phabricator.wikimedia.org/T182023 [17:51:30] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@df72b34]: Switch htmlCacheUpdate for wiktionaries, attempt 2 T182023 (duration: 00m 32s) [17:51:36] RECOVERY - Check systemd state on scb2001 is OK: OK - running: The system is fully operational [17:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:49] (03CR) 10Dzahn: [C: 04-1] "yea, that's true, i will amend and use 2 separate groups to address this issue and keep the commands restricted to the right hosts" [puppet] - 10https://gerrit.wikimedia.org/r/394102 (https://phabricator.wikimedia.org/T179317) (owner: 10Dzahn) [17:59:53] oh come on, there is patch merged in mw-config but not deployed [18:00:06] 10Operations, 10ORES, 10Scoring-platform-team, 10Release-Engineering-Team (Kanban), and 2 others: Git refusing to clone some ORES submodules - https://phabricator.wikimedia.org/T181552#3816966 (10mmodell) I'm not sure what to make of this one. I don't think T179013 ever effected production, so I'm not sure... [18:00:27] zeljkof: ping? [18:00:36] zeljkof: you merged https://gerrit.wikimedia.org/r/#/c/394379/ but it doesn't seem to have been deployed [18:00:49] mobrovac: really? [18:00:53] * zeljkof is looking [18:01:09] git log ..origin/master gives me the diff as being this patch [18:01:15] sorry about that, I thought I have deployed it :| [18:01:26] zeljkof: could you sync it now please? [18:01:35] mobrovac: sure, just a second [18:01:39] kk [18:03:01] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3816990 (10mmodell) I'd like to push the latest scap code to production this week if I can get an opsen to upload the package. I'll create a... [18:03:27] mobrovac: argh, long day, according to my history, I did run scap, but I forgot git pull/rebase :( [18:03:30] doing it now [18:04:01] zeljkof: k, just please do it quickly, i need to go out with a patch asap :) [18:04:06] doing it [18:04:50] hmm, that patch was merged right? [18:04:51] deploying... [18:05:08] Hauskatze: yes, but looks like I forgot to git fetch/rebase on tin :( [18:05:27] (03CR) 10Mobrovac: [C: 032] Disable producing htmlCacheUpdate to redis for wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395616 (https://phabricator.wikimedia.org/T182023) (owner: 10Ppchelko) [18:05:28] I did run scap, but that does not do anything if the code is not on tin [18:05:34] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:394379|Add https://studiezaal.nijmegen.nl to $wgCopyUploadsDomains (T181713)]] (duration: 00m 49s) [18:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:45] T181713: Please add https//studiezaal.nijmegen.nl to $wgCopyUploadsDomains - https://phabricator.wikimedia.org/T181713 [18:05:47] mobrovac, Hauskatze: deployed ^ [18:05:51] gr8 thnx [18:06:00] :) [18:06:03] mobrovac: sorry about that [18:06:18] no worries, thnx for fixing it quickly zeljkof [18:08:11] (03Merged) 10jenkins-bot: Disable producing htmlCacheUpdate to redis for wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395616 (https://phabricator.wikimedia.org/T182023) (owner: 10Ppchelko) [18:08:22] (03CR) 10jenkins-bot: Disable producing htmlCacheUpdate to redis for wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395616 (https://phabricator.wikimedia.org/T182023) (owner: 10Ppchelko) [18:10:02] (03PS1) 10Herron: puppet: set rhodium puppet_major_version to 4 [puppet] - 10https://gerrit.wikimedia.org/r/395791 (https://phabricator.wikimedia.org/T177254) [18:12:10] !log mobrovac@tin Synchronized wmf-config/InitialiseSettings.php: Switch htmlCacheUpdate jobs for wiktionaries to EventBus, file 1/2 - T182023 (duration: 00m 48s) [18:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:21] T182023: Migrate htmlCacheUpdate job to Kafka - https://phabricator.wikimedia.org/T182023 [18:13:19] !log mobrovac@tin Synchronized wmf-config/jobqueue.php: Switch htmlCacheUpdate jobs for wiktionaries to EventBus, file 2/2 - T182023 (duration: 00m 48s) [18:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:54] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler02/9213/rhodium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/395791 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [18:16:07] !log upgrading rhodium to puppet 4 [18:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:49] (03CR) 10Herron: [C: 032] puppet: set rhodium puppet_major_version to 4 [puppet] - 10https://gerrit.wikimedia.org/r/395791 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [18:17:17] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Migrate htmlCacheUpdate job to Kafka - https://phabricator.wikimedia.org/T182023#3817046 (10mobrovac) [18:26:06] PROBLEM - puppetmaster https on puppetmaster1001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 500 Internal Server Error [18:27:06] RECOVERY - puppetmaster https on puppetmaster1001 is OK: HTTP OK: Status line output matched 400 - 331 bytes in 0.037 second response time [18:29:29] Did anyone notice the two icinga-wm bots in here? [18:32:40] !log mobrovac@tin Started deploy [zotero/translators@3044b3a]: Update translators to 092c7bc - T178596 [18:32:48] !log mobrovac@tin Finished deploy [zotero/translators@3044b3a]: Update translators to 092c7bc - T178596 (duration: 00m 07s) [18:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:50] T178596: Update zotero translators on gerrit from the zotero repository on github - https://phabricator.wikimedia.org/T178596 [18:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:09] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3817115 (10awight) [18:33:12] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Review and fix file handle management in worker and celery processes - https://phabricator.wikimedia.org/T174402#3817113 (10awight) 05Open>03Resolved This can be resolved for now, it's a facet of the Celery 4 work. [18:35:01] (03PS4) 10MarcoAurelio: gerritbot: bolden `merged` on Phabricator comments [puppet] - 10https://gerrit.wikimedia.org/r/394748 (https://phabricator.wikimedia.org/T181886) [18:35:12] (03PS5) 10MarcoAurelio: gerritbot: bolden `merged` on Phabricator comments [puppet] - 10https://gerrit.wikimedia.org/r/394748 (https://phabricator.wikimedia.org/T181886) [18:35:48] (03CR) 10MarcoAurelio: "@Paladox, Chad and Dzhan: modified PatchSetMerged.vm as well." [puppet] - 10https://gerrit.wikimedia.org/r/394748 (https://phabricator.wikimedia.org/T181886) (owner: 10MarcoAurelio) [18:37:23] (03PS7) 10MarcoAurelio: Extension:Translate default permissions for Wikimedia wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/385953 (https://phabricator.wikimedia.org/T178793) [18:44:53] !log stopped irc echo on tegmen, re-enabled puppet and run it (it was disabled by a run-no-puppet sync_icinga_state) [18:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:03] thanks Zppix ;) [18:46:47] volans: np [18:50:38] (03CR) 10Cmjohnson: [C: 032] Adding mgmt and production dns entries for db111[1-4] [dns] - 10https://gerrit.wikimedia.org/r/395574 (owner: 10Cmjohnson) [18:53:55] volans: busy? [18:54:34] hey urandom, not really but starts to be late here ;) what's up? [18:55:00] i wanted to bounce something off you, but it's not super urgent, could wait until tomorrow [18:55:27] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Review and fix file handle management in worker and celery processes - https://phabricator.wikimedia.org/T174402#3817185 (10awight) [18:55:29] what is it? :) [18:55:53] 10Operations: run-no-puppet leave puppet disabled on kill/crash - https://phabricator.wikimedia.org/T182228#3817194 (10Volans) [18:55:55] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Review and fix file handle management in worker and celery processes - https://phabricator.wikimedia.org/T174402#3560572 (10awight) [18:56:25] i have one host in a cluster, a cluster of machines which should be (nearly) identical, that has abnormal disk utilization, and high latency (presumably as a result) [18:56:31] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3817196 (10awight) [18:56:46] i could use a pair of fresh eyes [18:57:05] if hw raid, did you check RAID policy? [18:57:09] which host? [18:57:14] fresh in the sense that they have not already been looking at this, but perhaps fresh in the sense that it is not too late in the evening :) [18:57:20] it's a software raid [18:57:22] well [18:57:29] restbase1010 [18:57:54] it's an HP with a hw RAID controller, but we're not using it for RAID [18:58:10] we upgraded the firmware yesterday, in the hopes that might help [18:59:22] load average and iowait is higher on that host [18:59:44] but the read/write rate is the same [18:59:55] at the application level, anyway [19:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy Morning SWAT (Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171206T1900). [19:00:05] No GERRIT patches in the queue for this window AFAICS. [19:00:33] volans: https://grafana-admin.wikimedia.org/dashboard/db/cassandra?orgId=1&from=1512576023357&to=1512586823357&var-datasource=eqiad%20prometheus%2Fservices&var-cluster=restbase&var-keyspace=enwiki_T_mobile__ng_lead&var-table=data&var-quantile=99p&panelId=1&fullscreen [19:00:37] ok, so I can give some fresh-as-8pm-fresh-can-be eyes :-P [19:00:50] about an order of magnatude higher latency [19:00:53] heh [19:01:26] it's been like this for some time, and users aren't seeing the latency (Cassandra routes around it), so it wouldn't need to be tonight [19:01:46] (03PS1) 10MarcoAurelio: Add gwtoolset to GlobalGroupPermissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395806 [19:01:48] bah, I have things I want to deploy but I am about to head out [19:01:51] hoo: around? [19:01:58] sure [19:02:09] urandom: 1012 seems to have had some hight latency earlier [19:02:13] too [19:02:16] I wanted to get https://gerrit.wikimedia.org/r/#/c/395785/ and https://gerrit.wikimedia.org/r/#/c/395786/ out [19:02:19] or is it a transient thing/ [19:02:20] ? [19:02:33] easy as pie, but I am literally walking out the door now! [19:03:03] probably only need to do the .11 one actually wikidatawiki will pick that up this evening [19:03:58] I can babysit the SWAT, that's fine w/ me [19:04:15] PROBLEM - puppetmaster https on puppetmaster1001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 500 Internal Server Error [19:04:29] volans: that seems to correspond with a period of high write throughput, so probably transient [19:04:45] though the fact that it stands out might be a clue [19:04:51] herron: is that you? ^^^ (icinga) [19:05:00] the high write throughput was something all nodes saw [19:05:15] RECOVERY - puppetmaster https on puppetmaster1001 is OK: HTTP OK: Status line output matched 400 - 331 bytes in 0.036 second response time [19:05:34] (03PS4) 10Ottomata: Puppetization for superset [puppet] - 10https://gerrit.wikimedia.org/r/392978 (https://phabricator.wikimedia.org/T166689) [19:05:39] volans hmm let's see [19:06:05] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3817228 (10akosiaris) I can handle that. I 'll try and build it tomorrow and upload it if successful [19:06:07] it just recovered, but might good to check it [19:06:31] hoo: I dont think anyone is swatting D: are you able to? [19:06:39] volans: not sure how much it stands out though, 1014 is nearly as bad [19:06:48] otherwise I'll just try and get them done in the evening slot! [19:07:10] Sure… considering there's nothing else to SWAt [19:07:19] <3 [19:07:29] * addshore waves goodbye! [19:07:29] urandom: yeah I can see at different times various spikes in latency in different hosts, are we sure it's only 1010 that is special or could be something else in the cluster? [19:07:33] :) [19:07:48] any lazy guy to help me with https://phabricator.wikimedia.org/T182143 ? :P [19:08:14] volans it looks like it's happening when requests are load balanced to rhodium because icinga is checking puppet 3 endpoint on puppetmaster1001 while rhodium has been upgraded to puppet 4 [19:08:19] volans: assume nothing! [19:08:20] :) [19:08:25] hmm marostegui or jynus should be able to revi [19:08:40] volans: yeah, certainly over the last 24 hours, it seems...less of a standout [19:08:46] urandom: ok :) [19:08:57] 1012 and 1014 are giving it a run for its money [19:09:03] they are recent additions [19:09:08] within the last 2 days [19:09:16] herron: ok, so the check needs to be updated for puppet4? will you take care of it as part of the upgrade process? [19:09:24] 1007 has good times [19:09:36] yes [19:09:56] so tbh is the only one [19:10:02] volans yes indeed, it's fixed in puppet drive by the puppet_major_version setting. this check is confusing though, the alert looks like a problem on puppetmaster1001 but behind the scenes that is a load balancer [19:10:12] 1008 & 1009 aren't too bad [19:10:19] 1008-9 have slower times than 1007 but no high spikes [19:10:21] so maybe should adjust that as well [19:10:24] Hauskatze: seems both of them are unavailable right now, :P [19:10:33] I'll come back tomorrow, no hurry anyway [19:10:49] revi: legoktm maybe? [19:10:53] herron: ok, I'll leave it to you, sorry, busy with this other thing :) [19:11:11] volans ok thanks for the ping! [19:11:23] Hauskatze: PM'ed him few minutes ago, I was thinking it would be good to ask if someone else was willing to do it [19:11:36] yw :) [19:12:42] urandom: so looking at the avg in grafana, seems that 10,12,14 are more or less equally bad (I've taken the larger period I can without the initial latency for the initial sync) [19:13:02] volans: [19:13:11] volans: on that basis, bad == HP [19:13:16] less bad == Dell [19:13:17] :( [19:13:33] both ssd I guess [19:13:53] all SSD, yes [19:15:36] what is the typical workload writes/reads? [19:16:36] * volans trying to assess if the smart path here is helping or making it worse [19:16:44] (03PS2) 10Chad: Add gerrit.googlesource.com and gitlab.com to Phab proxy whitelist [puppet] - 10https://gerrit.wikimedia.org/r/394640 (https://phabricator.wikimedia.org/T181835) [19:16:50] (03PS5) 10Ottomata: Puppetization for superset [puppet] - 10https://gerrit.wikimedia.org/r/392978 (https://phabricator.wikimedia.org/T166689) [19:17:11] Anyone got a sec to review that puppet change ^^ pretty easy and would unblock a few things for me [19:17:14] thanks in advance <3 [19:17:19] volans: mostly reads: https://grafana-admin.wikimedia.org/dashboard/db/cassandra?orgId=1 [19:17:34] (03PS6) 10Ottomata: Puppetization for superset [puppet] - 10https://gerrit.wikimedia.org/r/392978 (https://phabricator.wikimedia.org/T166689) [19:17:45] volans: to make things interesting, that relationship is a bit different in codfw [19:17:49] (if you change the datasource) [19:18:17] hard to do apples-apples comparisons there [19:18:24] eheheh [19:19:29] (03PS3) 10Dzahn: admins: new groups for reading varnish/mw logs, debugging [puppet] - 10https://gerrit.wikimedia.org/r/394102 (https://phabricator.wikimedia.org/T179317) [19:19:33] all of these machines have Samsung SSDs [19:19:47] (all with the exception of 2004 in codfw) [19:19:54] !log ppchelko@tin Started deploy [cpjobqueue/deploy@df72b34]: Deduplicate based on event sha1 as well as on id [19:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:04] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@df72b34]: Deduplicate based on event sha1 as well as on id (duration: 00m 09s) [19:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:44] (03PS4) 10Dzahn: admins: new groups for reading varnish/mw logs, debugging [puppet] - 10https://gerrit.wikimedia.org/r/394102 (https://phabricator.wikimedia.org/T179317) [19:21:08] urandom: so from this few minutes of looking at it, if it was me and it's not too complex I would pick one of the new hosts, disable the smart path in teh raid config and see how it goes [19:21:25] there are controversial benchmarks in real life workloads of this feature [19:21:45] (03PS7) 10Ottomata: Puppetization for superset [puppet] - 10https://gerrit.wikimedia.org/r/392978 (https://phabricator.wikimedia.org/T166689) [19:22:06] volans: smart path? [19:22:15] LD Acceleration Method: HPE SSD Smart Path [19:22:21] !log ppchelko@tin Started deploy [cpjobqueue/deploy@8c66189]: Deduplicate based on event sha1 as well as on id [19:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:33] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@8c66189]: Deduplicate based on event sha1 as well as on id (duration: 00m 11s) [19:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:42] !log ppchelko@tin Started deploy [cpjobqueue/deploy@8c66189]: Deduplicate based on event sha1 as well as on id, take 2 [19:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:53] (03PS1) 10Cmjohnson: Adding dhcpd entriesf or db111[12] [puppet] - 10https://gerrit.wikimedia.org/r/395814 [19:23:41] !log hoo@tin Synchronized php-1.31.0-wmf.10/extensions/Wikibase/repo/maintenance/dispatchChanges.php: dispatch: track how long client selecting takes (duration: 00m 48s) [19:23:50] (03PS5) 10Dzahn: admins: new groups for reading varnish/mw logs, debugging [puppet] - 10https://gerrit.wikimedia.org/r/394102 (https://phabricator.wikimedia.org/T179317) [19:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:06] volans: interesting [19:24:08] !log ppchelko@tin Started deploy [cpjobqueue/deploy@8c66189]: Deduplicate based on event sha1 as well as on id, take 2 [19:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:23] urandom: cannot promise it will help, it should actually be better with that on for read intensive workloads, and looking at iostat quickly I didn't see big differences in await and w_await [19:24:41] between 1007 and 1010 [19:24:46] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@8c66189]: Deduplicate based on event sha1 as well as on id, take 2 (duration: 00m 38s) [19:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:42] (03PS8) 10Ottomata: Puppetization for superset [puppet] - 10https://gerrit.wikimedia.org/r/392978 (https://phabricator.wikimedia.org/T166689) [19:28:12] urandom: other things worth to look at I don't have time right now are disk configs like block size, etc... [19:28:16] also one quick question [19:28:41] 1007 has ~180GB used per disk, 1010 ~260G... [19:29:06] is it normal/expected because of different weights in the cluster sharding? [19:29:10] yeah, we just bootstrapped new nodes in, and 1007 was a late addition [19:29:29] ok [19:29:34] 1010 will need a 'cleanup', a process that removes data that is no longer relevant to that node [19:29:40] unreachable data [19:29:52] so should not account for more IOPS atm, ok [19:30:03] I gotta go, dinner's ready... sorry [19:30:07] so 1007's utilization is closer to actual [19:30:08] 10Operations, 10Traffic, 10Documentation: update the multicast purging documentation - https://phabricator.wikimedia.org/T82096#3817344 (10MarcoAurelio) [19:30:15] makes sense [19:30:25] volans: thanks for the look! [19:30:32] have a nice dinner [19:30:36] I can have some more look later/tomorrow [19:30:45] thanks! [19:30:49] great; thanks! [19:32:01] !log hoo@tin Synchronized php-1.31.0-wmf.11/extensions/Wikibase/repo/maintenance/dispatchChanges.php: dispatch: track how long client selecting takes (duration: 00m 48s) [19:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:34] * hoo is done [19:32:41] (03CR) 10Cmjohnson: [C: 032] Adding dhcpd entriesf or db111[12] [puppet] - 10https://gerrit.wikimedia.org/r/395814 (owner: 10Cmjohnson) [19:40:09] (03PS1) 10Catrope: Enable ORES filters on simplewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395818 (https://phabricator.wikimedia.org/T182012) [19:41:05] (03CR) 10Dzahn: [C: 032] "thanks, being bold and boldening it" [puppet] - 10https://gerrit.wikimedia.org/r/394748 (https://phabricator.wikimedia.org/T181886) (owner: 10MarcoAurelio) [19:41:23] (03PS6) 10Dzahn: gerritbot: bolden `merged` on Phabricator comments [puppet] - 10https://gerrit.wikimedia.org/r/394748 (https://phabricator.wikimedia.org/T181886) (owner: 10MarcoAurelio) [19:42:22] (03CR) 10Awight: "Heads-up that this hasn't been fully tested on the beta cluster yet. I haven't demonstrated that thresholds look good in Special:RecentCh" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395818 (https://phabricator.wikimedia.org/T182012) (owner: 10Catrope) [19:42:24] (03PS1) 10Cmjohnson: Removing lingering dns entries of ms-fe1003 and ms-fe1004 [dns] - 10https://gerrit.wikimedia.org/r/395819 [19:43:39] (03CR) 10Catrope: [C: 04-1] "This doesn't work in beta labs yet, don't deploy until it does :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395818 (https://phabricator.wikimedia.org/T182012) (owner: 10Catrope) [19:44:29] (03CR) 10Cmjohnson: [C: 032] Removing lingering dns entries of ms-fe1003 and ms-fe1004 [dns] - 10https://gerrit.wikimedia.org/r/395819 (owner: 10Cmjohnson) [19:45:27] (03CR) 10Dzahn: "the request was approved after the scope was limited from root to the more specific sudo priv lines, that's why i do it like this here: ht" [puppet] - 10https://gerrit.wikimedia.org/r/394102 (https://phabricator.wikimedia.org/T179317) (owner: 10Dzahn) [19:48:46] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review, 10User-fgiunchedi: Decommission ms-fe100[1-4] - https://phabricator.wikimedia.org/T160986#3817666 (10Cmjohnson) [19:48:48] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review, 10User-fgiunchedi: Decommission ms-fe100[1-4] - https://phabricator.wikimedia.org/T160986#3117417 (10Cmjohnson) 05Open>03Resolved [19:51:47] 10Operations, 10Puppet, 10GerritBot, 10User-MarcoAurelio: Bolden the "merged" word when a patch is merged - https://phabricator.wikimedia.org/T181886#3817687 (10MarcoAurelio) 05Open>03Resolved p:05Triage>03Low a:03MarcoAurelio Patch merged and gerritbot output looks as expected so closing this ta... [19:55:17] (03PS1) 10Chad: Gerrit 2.14.6 [software/gerrit] - 10https://gerrit.wikimedia.org/r/395820 [19:56:22] paladox: When you get a chance, would love some testing on that build ^^ [19:58:03] (03Abandoned) 10Hashar: kibana: support elasticsearch.url setting [puppet] - 10https://gerrit.wikimedia.org/r/356900 (owner: 10Hashar) [19:58:34] (03Abandoned) 10Hashar: spec: subject type is infered from dir structure [puppet] - 10https://gerrit.wikimedia.org/r/383870 (owner: 10Hashar) [20:00:04] no_justification: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171206T2000). [20:00:05] No GERRIT patches in the queue for this window AFAICS. [20:00:14] Oh go away jouncebot, I don't wanna [20:00:54] (03PS1) 10Chad: Group1 to wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395823 [20:00:56] (03CR) 10Chad: [C: 032] Group1 to wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395823 (owner: 10Chad) [20:01:00] (03PS4) 10Hashar: graphite: cleanup servers.* [puppet] - 10https://gerrit.wikimedia.org/r/377414 [20:04:15] (03Merged) 10jenkins-bot: Group1 to wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395823 (owner: 10Chad) [20:04:41] 10Operations, 10ORES, 10Scoring-platform-team: Reimage ores* hosts with Debian Stretch - https://phabricator.wikimedia.org/T171851#3817763 (10awight) [20:04:44] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3817761 (10awight) 05Open>03Resolved @halfak Let's declare this a win. We showed that the new cluster is capable of keeping up w... [20:05:11] !log demon@tin Synchronized php: symlink bump (duration: 00m 47s) [20:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:33] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3817771 (10awight) [20:05:35] 10Operations, 10ORES, 10Scoring-platform-team, 10Release-Engineering-Team (Kanban), and 2 others: Git refusing to clone some ORES submodules - https://phabricator.wikimedia.org/T181552#3817772 (10awight) [20:05:39] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3817770 (10awight) [20:05:57] 10Operations, 10ORES, 10Scoring-platform-team, 10Release-Engineering-Team (Kanban), and 2 others: Git refusing to clone some ORES submodules - https://phabricator.wikimedia.org/T181552#3793896 (10awight) [20:06:07] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team: Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3797349 (10awight) [20:07:08] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 to wmf.11 [20:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:49] (03CR) 10Hashar: "Moritz and I rebooted contint1001 a few weeks ago. Seems it is preferable to have Jenkins to spawn on machine boot. I am not sure why I d" [puppet] - 10https://gerrit.wikimedia.org/r/392399 (owner: 10Hashar) [20:08:48] 10Operations: move human users out of UID range for system accounts - https://phabricator.wikimedia.org/T114446#1695535 (10zhuyifei1999) See also T45795 [20:10:03] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: partial rollback -- wikidata errors [20:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:44] Wikidata errors? [20:11:20] Fatal error: Call to undefined method ParserOutput::setRawText() in /srv/mediawiki/php-1.31.0-wmf.11/extensions/WikidataPageBanner/includes/WikidataPageBanner.hooks.php on line 165 [20:11:21] [Exception BadMethodCallException] (/srv/mediawiki/php-1.31.0-wmf.10/extensions/Wikibase/client/includes/Changes/AffectedPagesFinder.php:134) Call to a member function getSiteLinkChanges() on a non-object (string) [20:11:24] that? [20:11:31] oh [20:11:32] Er, ok that's two then [20:13:28] Anyway, I rolled wikidatawiki back to wmf.10 for now since the spike happened. Seems to have slowed down [20:13:47] Er, maybe not? [20:15:26] !log Ran "scap pull" on snapshot1001 after T177486 related tests [20:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:37] T177486: [Tracking] Wikidata entity dumpers need to cope with the immense Wikidata growth recently - https://phabricator.wikimedia.org/T177486 [20:15:50] (03CR) 10jenkins-bot: Group1 to wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395823 (owner: 10Chad) [20:16:22] I'm looking into this now… was just busy w/ some dump testing [20:16:28] Ty [20:19:09] (03PS3) 10Dzahn: Add gerrit.googlesource.com and gitlab.com to Phab proxy whitelist [puppet] - 10https://gerrit.wikimedia.org/r/394640 (https://phabricator.wikimedia.org/T181835) (owner: 10Chad) [20:19:44] (03PS4) 10Dzahn: phabricator: Add gerrit.googlesource.com and gitlab.com to Phab proxy whitelist [puppet] - 10https://gerrit.wikimedia.org/r/394640 (https://phabricator.wikimedia.org/T181835) (owner: 10Chad) [20:20:05] hoo: I thought it quieted down, but apparently not. I'm rolling all of group1 back now [20:20:44] Yeah :/ [20:20:47] (03PS1) 10Chad: Revert "Group1 to wmf.11" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395829 [20:20:49] (03CR) 10Chad: [C: 032] Revert "Group1 to wmf.11" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395829 (owner: 10Chad) [20:20:51] (03CR) 10Dzahn: [C: 032] phabricator: Add gerrit.googlesource.com and gitlab.com to Phab proxy whitelist [puppet] - 10https://gerrit.wikimedia.org/r/394640 (https://phabricator.wikimedia.org/T181835) (owner: 10Chad) [20:22:15] (03Merged) 10jenkins-bot: Revert "Group1 to wmf.11" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395829 (owner: 10Chad) [20:22:39] (03CR) 10jenkins-bot: Revert "Group1 to wmf.11" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395829 (owner: 10Chad) [20:36:24] 10Operations, 10Gerrit, 10ORES, 10Scoring-platform-team, and 2 others: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#3817844 (10demon) [20:36:27] 10Operations, 10Diffusion, 10Gerrit, 10ORES, and 5 others: Add gitlab to proxies/whitelist for mirroring to phabricator - https://phabricator.wikimedia.org/T181835#3817842 (10demon) 05Open>03Resolved a:03demon [20:39:13] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3817847 (10Halfak) One reason we might not want to provision these machines is that we won't be able to safely load test again. If w... [20:45:30] !log Add 400G to labsdb1003 /srv partition [20:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:49] (03PS1) 10Herron: puppetmaster: add proxypassmatch rules for puppet 4 url variants [puppet] - 10https://gerrit.wikimedia.org/r/395832 (https://phabricator.wikimedia.org/T177254) [20:50:24] no_justification: what went wrong with the train? [20:50:30] Wikidata stuffs [20:50:33] * hoo is looking [20:50:42] are there bug links? [20:50:45] Not yet [20:50:59] but good point [20:52:14] https://phabricator.wikimedia.org/T182243 [20:52:15] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3817883 (10awight) Point well taken. What if we temporarily depool some of the servers for future tests? Any single ores* machine c... [20:54:26] revi: hi [20:57:21] no_justification: How bad was the "ParserOutput::setRawText" error? [20:58:02] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3817916 (10Halfak) We could de-pool a whole datacenter. That would allow us to not mix traffic and run tests. That would also allow... [20:58:37] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3817917 (10awight) Let's do it. [20:59:34] legoktm: great, can you do it? [20:59:56] hoo, no_justification: https://gerrit.wikimedia.org/r/395834 [21:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Services – Parsoid / Citoid / Mobileapps / ORES / …. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171206T2100). [21:00:04] No GERRIT patches in the queue for this window AFAICS. [21:00:04] revi: links plz [21:00:13] https://phabricator.wikimedia.org/T182143 ? or Rename req? [21:00:37] revi: ok, go ahead [21:00:54] ok [21:00:56] (03PS1) 10EddieGP: alswiki: Set wgRestrictDisplayTitle = false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395835 (https://phabricator.wikimedia.org/T182154) [21:01:34] started [21:01:45] (03CR) 10jerkins-bot: [V: 04-1] alswiki: Set wgRestrictDisplayTitle = false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395835 (https://phabricator.wikimedia.org/T182154) (owner: 10EddieGP) [21:01:56] and hmm, new confirmation tick for '+50000 contribs [21:02:02] sounds wonderful [21:02:11] no_justification: So we have a serialization mismatch problem… we can either port back the forward compat to wmf10 or only start using the new serialization in wmf12 [21:02:15] any opinion? [21:02:16] Looks like nothing for ORES [21:02:23] re jouncebot [21:05:06] Rename will obviously take bit of time because he seem to have Synchbot userpage [21:05:30] page moves are further deferred from the rename [21:05:58] no_justification: https://gerrit.wikimedia.org/r/395837 Ok with you? Can you deploy that or shall I? [21:05:59] hoo: either choice is fine by me, whatever you think is best [21:06:29] I think using the old for now is more more conservative -> safer here [21:06:41] #cvt-sw plz :P [21:06:59] hoo: fine by me [21:08:30] DangSunM has most contribs on kowiki and wikidata, 25000 each [21:08:34] rest is just not that much :P [21:08:54] (03CR) 10EddieGP: "I'M SO DISTRACTED I'M FAILING ON A FUCKING ONE-LINE CONFIG CHANGE" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395835 (https://phabricator.wikimedia.org/T182154) (owner: 10EddieGP) [21:09:08] (03PS2) 10EddieGP: alswiki: Set wgRestrictDisplayTitle = false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395835 (https://phabricator.wikimedia.org/T182154) [21:09:08] eddiegp: calm down? [21:09:13] Ok… do you want to push the change? [21:10:18] legoktm: I'm fine, actually :D [21:10:48] hoo: I'm actually in the middle of making lunch if it's not too much hassle for you [21:10:55] Ok, will do [21:12:44] eddiegp: lol [21:12:48] now about be-tarask hmmmm [21:13:00] meh, just got another bunch of these errors [21:13:04] I will have to skip tonight's bed [21:13:08] Probably jobs re-trying :( [21:13:33] yeah, we're 1h past [21:15:39] If we want to prevent this, we would need to actual make wmf10 forward compatible… mh [21:16:59] Considering this, I think it might actually be nicer to port this back to wmf10 [21:17:11] otherwise these changes will cause trouble one more time (and then get lost) [21:24:35] !log arlolra@tin Started deploy [parsoid/deploy@dfcc622]: Updating Parsoid to 01c1fc3 [21:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:23] (03PS3) 10Andrew Bogott: wmcs: move standard includes from site to roles [puppet] - 10https://gerrit.wikimedia.org/r/394625 (owner: 10Dzahn) [21:28:42] cvn-sw saying it's around cswikisource [21:34:20] !log arlolra@tin Finished deploy [parsoid/deploy@dfcc622]: Updating Parsoid to 01c1fc3 (duration: 09m 45s) [21:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:26] no_justification: I'm going with https://gerrit.wikimedia.org/r/395850 now [21:35:37] See my msg above and https://phabricator.wikimedia.org/T182243#3818044 [21:36:08] legoktm: in case I'm suddenly gone, consider it as I'm fell asleep [21:36:34] I hope nothing will happen but just in case... [21:36:35] hoo: Ack. Catching up now (mildly in a food coma) [21:37:14] I'm self-merging as no one else from my team w/ knowledge of this code is around [21:37:28] oooooh, self merge?! I'm telling!! [21:37:58] Manual c&p backport… what could possibly go wrong? [21:38:02] I tested it on my notebook [21:39:05] look what you did hoo [21:39:24] lol [21:39:39] self-merge -2 now!!!11 [21:39:44] ;) [21:40:14] today's lesson: self-merge can cause freenode netsplit, noted [21:40:21] /jk [21:41:12] legoktm: What about https://gerrit.wikimedia.org/r/395834? :/ [21:43:17] !log Updated Parsoid to 01c1fc3 (T178253, T61840, T180930, T179259) [21:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:32] T179259: Visual Editor removing whitespace from infoboxes by default - https://phabricator.wikimedia.org/T179259 [21:43:32] T180930: When a table is unclosed, parsoid adds the closing tag without respecting SOL constraints on the closing tag -- probably limited to selective serialization - https://phabricator.wikimedia.org/T180930 [21:43:32] T178253: Figure handler rejects nested tables in figure captions - https://phabricator.wikimedia.org/T178253 [21:43:32] T61840: jsdiff uses excessive time / memory on pages with a lot of lines - https://phabricator.wikimedia.org/T61840 [21:46:35] no_justification ah thanks, will test that :) [21:46:47] Im already on 2.14 so should be a easy swap. [21:47:34] 10Operations, 10Scoring-platform-team, 10Patch-For-Review, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3818137 (10awight) [21:48:07] paladox: I also compiled gitiles against 2.14.6 and uploaded to archiva. I'm not ready to push it out yet (let's let 2.14 itself land first), but it's there for testing [21:48:20] I'd *like* to swap it out, as it's wayyyyy faster and the UI will go nicer with the polygerrit future [21:48:31] ah thanks :). [21:49:20] hoo: test failures look unrelated... [21:49:28] Yeah [21:49:30] force merge? [21:49:59] CI for that is still using the build [21:50:21] hoo: I think so [21:51:11] (03CR) 10Andrew Bogott: [C: 031] wmcs: move standard includes from site to roles [puppet] - 10https://gerrit.wikimedia.org/r/394625 (owner: 10Dzahn) [21:51:18] Copy+pasted backports, self-merges to avoid CI [21:51:22] This is getting better and better! [21:51:23] :D [21:51:46] !log hoo@tin Synchronized php-1.31.0-wmf.10/extensions/Wikibase/lib/includes/Changes/EntityChange.php: Make EntityChange truly forward compatible with compact diffs (T182243) (duration: 00m 48s) [21:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:57] T182243: Fatal in AffectedPagesFinder: Call to a member function getSiteLinkChanges() on a non-object (string) - https://phabricator.wikimedia.org/T182243 [21:52:14] no_justification: no respect :P [21:52:26] Ok, Wikibase should be fine now [21:53:58] 10Operations, 10Scoring-platform-team, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3818175 (10awight) 05Open>03Resolved [21:57:14] no_justification i will test now :). [21:57:53] (03PS1) 10Chad: Revert "Revert "Group1 to wmf.11"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395862 [21:57:56] (03CR) 10Chad: [C: 032] Revert "Revert "Group1 to wmf.11"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395862 (owner: 10Chad) [21:59:22] (03Merged) 10jenkins-bot: Revert "Revert "Group1 to wmf.11"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395862 (owner: 10Chad) [21:59:33] (03CR) 10jenkins-bot: Revert "Revert "Group1 to wmf.11"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395862 (owner: 10Chad) [22:00:13] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 to wmf.11, again [22:00:22] Isn't legoktm's fix still pending? [22:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:45] yeah, someone needs to force merge it I guess [22:01:26] * no_justification sighs [22:01:42] (03PS1) 10Dzahn: site: add conf100[456] with role spare [puppet] - 10https://gerrit.wikimedia.org/r/395863 (https://phabricator.wikimedia.org/T166081) [22:02:02] merged to master & wmf.11 [22:02:34] * mutante kicks wikibugs [22:02:59] well, not really, i just want it to talk [22:03:49] !log demon@tin Synchronized php-1.31.0-wmf.11/extensions/WikidataPageBanner/includes/WikidataPageBanner.hooks.php: unbreak (duration: 00m 48s) [22:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:02] legoktm: i think wikibugs stopped talking [22:07:19] It seems happy in -dev [22:07:39] mutante: example of missed things? [22:08:40] legoktm: sorry, i wanted to say this, but it was my fault: "<+wikibugs> (PS1) Dzahn: site: add conf100[456] with role spare" [22:09:25] (03CR) 10Dzahn: [C: 032] site: add conf100[456] with role spare [puppet] - 10https://gerrit.wikimedia.org/r/395863 (https://phabricator.wikimedia.org/T166081) (owner: 10Dzahn) [22:09:26] legoktm: maybe it just needs a min to catch up? [22:09:38] :) [22:12:12] kowiki with 25000+ contribs is coming [22:12:12] :P [22:12:55] * Hauskatze prepares the popcorn [22:13:05] not for you revi, detention for you :P [22:13:17] :(((((( [22:13:32] 2 wikis to get kowiki [22:13:35] now [22:13:52] oh koi [22:14:03] and now [22:14:08] ko.wikipedia.org In progress [22:14:21] done [22:14:24] really? [22:14:28] wow [22:15:00] phew [22:15:06] except kowiki RC is doomed [22:15:19] heh [22:15:32] this is on the low-end of supervision-needed renames [22:15:37] (03PS1) 10Dzahn: site: fix "spare" -> "spare::system" for conf100[456] [puppet] - 10https://gerrit.wikimedia.org/r/395869 [22:15:49] lol [22:15:51] Melos uploaded a patch some time ago to make those moves invisible in recent changes [22:16:11] still waiting for review [22:16:16] rename move should be treated as a bot edit [22:16:17] :P [22:16:36] PROBLEM - puppet last run on conf1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:16:39] (03CR) 10Dzahn: [C: 032] site: fix "spare" -> "spare::system" for conf100[456] [puppet] - 10https://gerrit.wikimedia.org/r/395869 (owner: 10Dzahn) [22:17:26] hmm, renaming is going fast now [22:18:27] maybe hoo or legoktm could review https://gerrit.wikimedia.org/r/#/c/372803/ and see if it is good now [22:18:37] if they have time/wish to do so ofc [22:19:21] no_justification hmm it fails with [22:19:24] stuff like [22:19:24] rsync: link_stat "/git-fat/379af2caef8ed13a2b1039a3fc20a0a6b70639c9" (in archiva) failed: No such file or directory (2) [22:19:33] * hoo needs to go back to university stuffs now :/ [22:19:34] Ohnos :( [22:19:53] though i did a git pull [22:20:00] (i mean git fat pull) [22:20:02] https://phabricator.wikimedia.org/P6437 [22:20:30] Mehhhhhhhh [22:21:28] git fat? lol [22:21:33] i think it's just some did not manage to get uploaded to archiva (unless im doing it wrong). [22:21:36] RECOVERY - puppet last run on conf1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [22:21:45] Yeh, git fat saves storage as gerrit's binary is very large. [22:21:54] paladox: Either that, or I somehow uploaded the wrong files :\ [22:22:06] We could swap for git-lfs :p [22:22:07] hehehe [22:22:11] ah [22:22:14] heh [22:22:23] yeh, if you want no_justification :) [22:22:41] i can set it up if you want? [22:25:52] 10Operations, 10ORES, 10Scoring-platform-team, 10Wikimedia-Incident: Create an incident report for ORES overload incident 2017 - https://phabricator.wikimedia.org/T181795#3802542 (10greg) https://wikitech.wikimedia.org/wiki/Incident_documentation/20171128-ORES [22:26:20] 10Operations, 10Scoring-platform-team, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538#3818306 (10greg) [22:26:22] paladox: Actually, we can't yet...I haven't finished scap support for lfs [22:26:25] 10Operations, 10ORES, 10Scoring-platform-team, 10Wikimedia-Incident: Create an incident report for ORES overload incident 2017 - https://phabricator.wikimedia.org/T181795#3818304 (10greg) 05Open>03Resolved a:03awight [22:26:44] no_justification when we git pull, it should include it anyways [22:26:53] i found that git pulling the normal way works [22:27:51] Yeah but you need the git-lfs installed with git for it to work [22:28:00] Which we don't have in production yet (no sane packaging) [22:28:07] ah i see [22:28:32] i guess, we should try to reupload it to archiva :) [22:31:24] I should've gotten it the first time, I didn't really do anything different :\ [22:31:54] Archiva has a rest API, I should write a tool that does bazel then uploads the artifacts for me [22:32:47] heh [22:34:38] 10Operations, 10ops-eqiad: Decommission stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T175150#3584101 (10Dzahn) I noticed this host was up and running but not in site.pp / no roles as part of decoming Ganglia from everything (T177225). It's gone from Icinga but running. I am adding it back to site.... [22:35:25] gerrit.war was uploaded correctly [22:35:34] but seems the plugins some how did not reach archiva. [22:35:45] My guess is I uploaded the wrong dang files [22:40:31] (03PS1) 10Dzahn: various misc roles: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/395873 (https://phabricator.wikimedia.org/T177225) [22:40:51] heh [22:44:46] (03PS1) 10Dzahn: site: add stat1003 ghost back to site [puppet] - 10https://gerrit.wikimedia.org/r/395874 (https://phabricator.wikimedia.org/T175150) [22:46:14] (03CR) 10Dzahn: [C: 032] various misc roles: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/395873 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [22:46:44] (03CR) 10Dzahn: [C: 032] site: add stat1003 ghost back to site [puppet] - 10https://gerrit.wikimedia.org/r/395874 (https://phabricator.wikimedia.org/T175150) (owner: 10Dzahn) [22:48:56] !log stat1003 - this host was kind of invisible (not in site, not in icinga) but still up, re-enabling puppet after re-adding it to site [22:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:13] (03PS1) 10Dzahn: site: fix another "spare" -> "spare::system" mistake [puppet] - 10https://gerrit.wikimedia.org/r/395876 [22:51:43] (03CR) 10Dzahn: [C: 032] site: fix another "spare" -> "spare::system" mistake [puppet] - 10https://gerrit.wikimedia.org/r/395876 (owner: 10Dzahn) [22:51:54] * paladox will wait for a new archiva upload :) [22:55:12] hmm 200 wikis left [22:55:25] !log stat1003 - re-enabled puppet after putting role::spare on it (T175150) [22:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:39] T175150: Decommission stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T175150 [22:57:45] !log ppchelko@tin Started deploy [cpjobqueue/deploy@761524f]: Separate delay and totaldelay metrics T182216 [22:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:55] T182216: Separate delay and totaldelay metrics in CP - https://phabricator.wikimedia.org/T182216 [22:58:17] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@761524f]: Separate delay and totaldelay metrics T182216 (duration: 00m 31s) [22:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:58] jouncebot: refresh [23:04:03] I refreshed my knowledge about deployments. [23:04:07] jouncebot: now [23:04:07] No deployments scheduled for the next 0 hour(s) and 55 minute(s) [23:04:32] Meh, time zones got me again. [23:05:32] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [23:05:33] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0 [23:06:02] PROBLEM - puppet last run on oxygen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:06:35] aha, did i catch a special case [23:09:33] (03CR) 10Paladox: "Also needs to be re uploaded to archiva after rebuilding this :)." (031 comment) [software/gerrit] - 10https://gerrit.wikimedia.org/r/395820 (owner: 10Chad) [23:12:32] !log ppchelko@tin Started restart [changeprop/deploy@065a06e]: (no justification provided) [23:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:42] urandom: still around? [23:18:31] ottomata: hey, just one more sanity check, can i kill the "kafkatee"-ganglia-view as well? https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=&tab=v&vn=&hide-hf=false [23:18:57] (ran into a puppet issue on oxygen on decom because that view exists) [23:19:52] I guess not, I just wanted to mention that the r/w balance on the IOPS is more around 62%r/38%w according to the single machine stats. The one in the Cassandra dashboard is misleading, I guess because reads are also served from RAM. [23:21:51] (03PS1) 10Dzahn: oxygen: revert removing ganglia [puppet] - 10https://gerrit.wikimedia.org/r/395884 (https://phabricator.wikimedia.org/T177225) [23:31:18] wikidata hmm [23:32:44] and wd is done, so it's almost done [23:34:35] (03CR) 10Dzahn: [C: 032] oxygen: revert removing ganglia [puppet] - 10https://gerrit.wikimedia.org/r/395884 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [23:38:14] thanks legoktm for being around :D [23:41:02] RECOVERY - puppet last run on oxygen is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [23:41:10] no_justification seems it's in archiva https://archiva.wikimedia.org/#basicsearch/gerrit just maybe the wrong sha? [23:42:45] That's my guess....but weird, I should've used the same files.... [23:46:13] (03PS9) 10Ottomata: Puppetization for superset [puppet] - 10https://gerrit.wikimedia.org/r/392978 (https://phabricator.wikimedia.org/T166689) [23:46:49] yep [23:46:50] (03CR) 10jerkins-bot: [V: 04-1] Puppetization for superset [puppet] - 10https://gerrit.wikimedia.org/r/392978 (https://phabricator.wikimedia.org/T166689) (owner: 10Ottomata) [23:47:30] (03PS10) 10Ottomata: Puppetization for superset [puppet] - 10https://gerrit.wikimedia.org/r/392978 (https://phabricator.wikimedia.org/T166689) [23:48:00] (03CR) 10jerkins-bot: [V: 04-1] Puppetization for superset [puppet] - 10https://gerrit.wikimedia.org/r/392978 (https://phabricator.wikimedia.org/T166689) (owner: 10Ottomata) [23:48:17] (03CR) 10Ottomata: "Luca! Review me please! :)" [puppet] - 10https://gerrit.wikimedia.org/r/392978 (https://phabricator.wikimedia.org/T166689) (owner: 10Ottomata) [23:49:04] (03PS11) 10Ottomata: Puppetization for superset [puppet] - 10https://gerrit.wikimedia.org/r/392978 (https://phabricator.wikimedia.org/T166689) [23:49:38] (03CR) 10jerkins-bot: [V: 04-1] Puppetization for superset [puppet] - 10https://gerrit.wikimedia.org/r/392978 (https://phabricator.wikimedia.org/T166689) (owner: 10Ottomata) [23:56:23] (03PS1) 10Dzahn: ganglia: delete views for kafkatee, hadoop, varnishkafka [puppet] - 10https://gerrit.wikimedia.org/r/395890 (https://phabricator.wikimedia.org/T177225)