[00:00:05] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy Evening SWAT (Max 8 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180307T0000). [00:00:05] tgr, Jdlrobson, and Amir1: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:13] o/ [00:00:16] mine is not testable [00:02:47] \o [00:08:49] tgr: are you able to swat? [00:09:00] I can SWAT [00:09:06] Amir1: \o/ [00:09:20] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416741 (https://phabricator.wikimedia.org/T188182) (owner: 10Jdlrobson) [00:10:20] tgr: let me know when you're around [00:10:48] (03Merged) 10jenkins-bot: Re-enable Wikidata descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416741 (https://phabricator.wikimedia.org/T188182) (owner: 10Jdlrobson) [00:11:03] (03CR) 10jenkins-bot: Re-enable Wikidata descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416741 (https://phabricator.wikimedia.org/T188182) (owner: 10Jdlrobson) [00:12:31] jdlrobson: your patch is live in mwdebug1002 [00:12:37] Amir1: on it! [00:12:43] thanks! [00:17:37] Amir1: sorry this is taking a little longer. Not sure if you can do your patch on 1001 in meantime? [00:18:11] im not seeing the changes i hoped so need some more time to debug :/ [00:18:17] jdlrobson: take your time, mine is not testable, so that's quick. I see lots of errors in logstash though [00:18:20] for mwdebug1002 [00:18:25] oh really? [00:18:27] lemme check [00:18:34] 25 exceptions so far [00:18:51] jdlrobson: https://logstash.wikimedia.org/app/kibana#/dashboard/mwdebug1002 [00:18:58] (Do you have access?) [00:19:31] ACK [00:19:32] stupid me [00:19:35] yeh that patch is no good [00:19:42] format is completely wrong :/ [00:20:27] jdlrobson: I have enough time if you can make a quick follow up [00:20:40] thank you! [00:23:09] (03PS1) 10Jdlrobson: Re-enable Wikidata descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416872 (https://phabricator.wikimedia.org/T188182) [00:23:13] ^ Amir1 [00:23:18] will add this to deployment page [00:24:04] done [00:24:42] Thanks1 [00:24:50] let's wait for the jenkins [00:25:03] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416872 (https://phabricator.wikimedia.org/T188182) (owner: 10Jdlrobson) [00:26:14] (03Merged) 10jenkins-bot: Re-enable Wikidata descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416872 (https://phabricator.wikimedia.org/T188182) (owner: 10Jdlrobson) [00:27:04] 10Operations, 10DNS, 10Mail, 10Traffic: SPF for Greenhouse - https://phabricator.wikimedia.org/T189065#4029896 (10tstarling) [00:27:29] jdlrobson: please test in mwdebug1002 [00:28:21] (03CR) 10jenkins-bot: Re-enable Wikidata descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416872 (https://phabricator.wikimedia.org/T188182) (owner: 10Jdlrobson) [00:29:47] Amir1 sweet! that did the trick! [00:29:53] Amir1: sync away! [00:30:24] ack! let's go [00:32:46] !log ladsgroup@tin Synchronized wmf-config/InitialiseSettings.php: Re-enable Wikidata descriptions (T188182) (duration: 01m 16s) [00:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:03] jdlrobson: it's live! [00:33:04] T188182: Regression: Wikidata descriptions are not showing on various wikis - https://phabricator.wikimedia.org/T188182 [00:33:19] tgr: around? [00:35:21] Amir1: \o/ [00:35:23] thanks Amir1 [00:35:58] yw, thanks for releasing with releng [00:37:48] :) [00:48:26] (03CR) 10Krinkle: "File exists now and is synced to prod." [puppet] - 10https://gerrit.wikimedia.org/r/416637 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [00:48:32] (03PS3) 10Krinkle: mediawiki: Enable auto_prepend_file on canary_appserver [puppet] - 10https://gerrit.wikimedia.org/r/416637 (https://phabricator.wikimedia.org/T180183) [00:48:50] greg-g: it seems phan is voting on branches of wikibase, it's failing. What should I do? [00:49:07] hashar force-mreged some patches before [00:49:13] hope that's fine [00:49:51] Amir1: is phan there new? I didn't think so? [00:50:22] greg-g: I'm pretty sure phan never was passing [00:50:34] one of our teammates have been working to make it pass for months now [00:51:02] and selenium fails too :/ [00:51:19] (03PS1) 10Zhuyifei1999: toollabs: install python{,3}-pymysql on exec_environ [puppet] - 10https://gerrit.wikimedia.org/r/416874 (https://phabricator.wikimedia.org/T189052) [00:52:10] I'm confused on what changed [00:54:57] greg-g: this another case https://gerrit.wikimedia.org/r/#/c/415319/ [00:55:05] in older cases we didn't have phan at all [00:55:09] https://gerrit.wikimedia.org/r/#/c/415290/ [00:56:33] PROBLEM - HHVM rendering on mw2146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:57:22] and selenium failures are happening because the fix for browser tests got merged after the branch cut [00:57:23] RECOVERY - HHVM rendering on mw2146 is OK: HTTP OK: HTTP/1.1 200 OK - 75051 bytes in 0.305 second response time [00:57:43] I will back port the selenium fix tomorrow and then try again [00:57:50] It's already pretty late here [00:57:59] !log Evening SWAT is done [00:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:17:17] (03CR) 10BryanDavis: "Making it easier to run pywikibot on the bastions is not my idea of beneficial, but it get the intent. This does create yet another expect" [puppet] - 10https://gerrit.wikimedia.org/r/416874 (https://phabricator.wikimedia.org/T189052) (owner: 10Zhuyifei1999) [01:54:18] RECOVERY - Host cp5012.mgmt is UP: PING WARNING - Packet loss = 80%, RTA = 249.40 ms [01:54:27] RECOVERY - Host mr1-eqsin is UP: PING OK - Packet loss = 0%, RTA = 242.09 ms [01:54:47] RECOVERY - Host asw1-eqsin is UP: PING OK - Packet loss = 0%, RTA = 241.92 ms [01:56:17] RECOVERY - Host cp5010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 246.28 ms [01:56:37] RECOVERY - Host dns5002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 241.69 ms [01:56:37] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: host 103.102.166.128, interfaces up: 37, down: 1, dormant: 0, excluded: 1, unused: 0 [01:58:07] RECOVERY - Host cp5003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 240.21 ms [01:58:07] RECOVERY - Host cp5002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 246.54 ms [01:58:07] RECOVERY - Host cp5004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 241.73 ms [01:58:07] RECOVERY - Host cp5001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 252.60 ms [01:58:08] RECOVERY - Host cp5005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 254.75 ms [01:58:28] RECOVERY - Host cp5007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 248.04 ms [01:58:28] RECOVERY - Host cp5009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 243.85 ms [01:58:28] RECOVERY - Host cp5011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 250.82 ms [01:58:37] RECOVERY - Host cp5008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 243.87 ms [01:59:07] RECOVERY - Host dns5001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.80 ms [01:59:51] (03CR) 10Zhuyifei1999: "Right now virtualenvs are mandatory for k8s python webservices. Many python tools that are complicated enough also has their own virtualen" [puppet] - 10https://gerrit.wikimedia.org/r/416874 (https://phabricator.wikimedia.org/T189052) (owner: 10Zhuyifei1999) [02:03:35] (03PS1) 10Krinkle: Improve load-order documentation for CommonSettings and InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416879 [02:06:37] _joe_: anomie: Regarding load order in wmf-config, I added some docs for it a while back, but those were only in the -labs files. Added it to the prod files as well ^^ [02:08:04] 10Operations, 10ops-eqsin, 10Traffic, 10netops: replace eqsin SFP-T/SFP+ - https://phabricator.wikimedia.org/T188923#4030533 (10Papaul) 05Open>03Resolved Complete [02:09:36] ACKNOWLEDGEMENT - Host restbase-dev1006 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T185494 [02:10:19] (03CR) 10Krinkle: "Fly-by-comment: For manual stuff, I typically use a shell pod on the k8s job, which makes it fairly straight forward. E.g. 'webservice she" [puppet] - 10https://gerrit.wikimedia.org/r/416874 (https://phabricator.wikimedia.org/T189052) (owner: 10Zhuyifei1999) [02:24:18] PROBLEM - HHVM rendering on mw2108 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:25:17] RECOVERY - HHVM rendering on mw2108 is OK: HTTP OK: HTTP/1.1 200 OK - 75043 bytes in 0.376 second response time [02:27:17] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 [02:27:52] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.23) (duration: 06m 03s) [02:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:43:54] (03CR) 10Zhuyifei1999: "@Krinkle Yeah, that's exactly the recommended way to open a shell on k8s." [puppet] - 10https://gerrit.wikimedia.org/r/416874 (https://phabricator.wikimedia.org/T189052) (owner: 10Zhuyifei1999) [02:53:07] RECOVERY - Host mr1-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 247.72 ms [02:57:23] (03PS1) 10Ayounsi: Add mr1-eqsin <-> cr1-eqsin IPv6 [dns] - 10https://gerrit.wikimedia.org/r/416882 [03:07:27] PROBLEM - HHVM rendering on mw2147 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:08:17] RECOVERY - HHVM rendering on mw2147 is OK: HTTP OK: HTTP/1.1 200 OK - 75049 bytes in 0.304 second response time [03:29:27] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 864.65 seconds [03:59:27] PROBLEM - Nginx local proxy to apache on mw2132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:00:17] RECOVERY - Nginx local proxy to apache on mw2132 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.201 second response time [04:02:37] PROBLEM - SSH bast5001.mgmt on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:09:18] PROBLEM - SSH lvs5002.mgmt on lvs5002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:14:47] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 285.00 seconds [04:15:47] PROBLEM - SSH dns5002.mgmt on dns5002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:19:57] PROBLEM - SSH cp5001.mgmt on cp5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:19:57] PROBLEM - SSH cp5002.mgmt on cp5002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:19:57] PROBLEM - SSH cp5003.mgmt on cp5003.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:19:57] PROBLEM - SSH cp5004.mgmt on cp5004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:19:57] PROBLEM - SSH cp5005.mgmt on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:20:57] PROBLEM - SSH cp5010.mgmt on cp5010.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:22:57] PROBLEM - SSH cp5007.mgmt on cp5007.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:30:08] PROBLEM - SSH cp5009.mgmt on cp5009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:32:17] PROBLEM - SSH cp5012.mgmt on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:33:17] PROBLEM - SSH cp5011.mgmt on cp5011.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:33:27] PROBLEM - SSH cp5008.mgmt on cp5008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:38:27] PROBLEM - SSH lvs5001.mgmt on lvs5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:44:08] ACKNOWLEDGEMENT - SSH bast5001.mgmt on bast5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds Brandon Black eqsin mgmt network issues, non-critical atm, possibly related to T188923 [04:44:08] ACKNOWLEDGEMENT - SSH cp5001.mgmt on cp5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds Brandon Black eqsin mgmt network issues, non-critical atm, possibly related to T188923 [04:44:08] ACKNOWLEDGEMENT - SSH cp5002.mgmt on cp5002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds Brandon Black eqsin mgmt network issues, non-critical atm, possibly related to T188923 [04:44:08] ACKNOWLEDGEMENT - SSH cp5003.mgmt on cp5003.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds Brandon Black eqsin mgmt network issues, non-critical atm, possibly related to T188923 [04:44:08] ACKNOWLEDGEMENT - SSH cp5004.mgmt on cp5004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds Brandon Black eqsin mgmt network issues, non-critical atm, possibly related to T188923 [04:44:08] ACKNOWLEDGEMENT - SSH cp5005.mgmt on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds Brandon Black eqsin mgmt network issues, non-critical atm, possibly related to T188923 [04:44:08] ACKNOWLEDGEMENT - SSH cp5007.mgmt on cp5007.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds Brandon Black eqsin mgmt network issues, non-critical atm, possibly related to T188923 [04:44:09] ACKNOWLEDGEMENT - SSH cp5008.mgmt on cp5008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds Brandon Black eqsin mgmt network issues, non-critical atm, possibly related to T188923 [04:44:09] ACKNOWLEDGEMENT - SSH cp5009.mgmt on cp5009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds Brandon Black eqsin mgmt network issues, non-critical atm, possibly related to T188923 [04:44:10] ACKNOWLEDGEMENT - SSH cp5010.mgmt on cp5010.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds Brandon Black eqsin mgmt network issues, non-critical atm, possibly related to T188923 [04:44:10] ACKNOWLEDGEMENT - SSH cp5011.mgmt on cp5011.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds Brandon Black eqsin mgmt network issues, non-critical atm, possibly related to T188923 [04:44:11] ACKNOWLEDGEMENT - SSH cp5012.mgmt on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds Brandon Black eqsin mgmt network issues, non-critical atm, possibly related to T188923 [05:19:50] <_joe_> Krinkle: thanks, I did chase it down across repos yesterday anyways, I'm happy I got it right :P [05:20:15] <_joe_> Krinkle: also <3 for adding those docs [05:44:27] PROBLEM - Apache HTTP on mw2212 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:45:17] RECOVERY - Apache HTTP on mw2212 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.119 second response time [05:51:35] 10Operations, 10Analytics, 10Traffic: Investigate and fix odd uri_host values - https://phabricator.wikimedia.org/T188804#4020291 (10Nuria) Ok, was about to say same, this is basically reproducible with a curl in which you override host header. [06:14:58] bblack: regarding uri noise, I share that it’s unsurprising and not an error for those to be seen at Varnish front end. But just curious about what you meant by “rejected by MediaWiki”, certainly not MW PHP? For the most part I’d expect Varnish to respond with errorpage,404,Domain not served here. For cases that somehow make it to Apache, will fall through to the default *:80 vhost, serving statically from Apache the [06:14:58] doctor/default/index.html [06:15:37] Which also Unknown domain, and is indeed an odd 200 OK but that’s expected from Apache given that it matches based on path and not Hostname [06:16:21] I’m curious in what cases a request can make it to Apache in this way though, I’ve never been able to reproduce that myself. I always get Varnish serving 404 [06:31:28] 10Operations, 10DBA, 10cloud-services-team: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4030976 (10Marostegui) @Andrew which day/time would work for you to get this done? [06:34:29] (03PS1) 10Marostegui: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416892 (https://phabricator.wikimedia.org/T187089) [06:36:27] PROBLEM - HHVM rendering on mw2129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:36:27] PROBLEM - HHVM rendering on mw2102 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:37:12] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416892 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:37:17] RECOVERY - HHVM rendering on mw2129 is OK: HTTP OK: HTTP/1.1 200 OK - 75057 bytes in 0.391 second response time [06:37:17] RECOVERY - HHVM rendering on mw2102 is OK: HTTP OK: HTTP/1.1 200 OK - 75057 bytes in 0.314 second response time [06:38:25] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416892 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:38:41] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416892 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:40:05] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1079 for alter table (duration: 01m 16s) [06:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:11] (03PS3) 10Madhuvishy: dumps: Split up rsync config to base, mirrors, and datasets [puppet] - 10https://gerrit.wikimedia.org/r/416869 (https://phabricator.wikimedia.org/T188726) [06:43:07] madhuvishy: o/ [06:49:25] !log Deploy schema change on db1079 with replication enabled (this will generate lag on labs) - T187089 T185128 T153182 [06:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:40] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [06:49:40] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [06:49:40] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [06:51:12] (03PS4) 10Madhuvishy: dumps: Split up rsync config to base, mirrors, and datasets [puppet] - 10https://gerrit.wikimedia.org/r/416869 (https://phabricator.wikimedia.org/T188726) [06:56:04] (03PS5) 10Madhuvishy: dumps: Split up rsync config to base, mirrors, and datasets [puppet] - 10https://gerrit.wikimedia.org/r/416869 (https://phabricator.wikimedia.org/T188726) [06:58:57] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [06:59:47] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [07:01:43] (03CR) 10Madhuvishy: "https://puppet-compiler.wmflabs.org/compiler02/10299/" [puppet] - 10https://gerrit.wikimedia.org/r/416869 (https://phabricator.wikimedia.org/T188726) (owner: 10Madhuvishy) [07:02:27] PROBLEM - HHVM rendering on mw2140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:03:17] RECOVERY - HHVM rendering on mw2140 is OK: HTTP OK: HTTP/1.1 200 OK - 75073 bytes in 0.290 second response time [07:18:27] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [07:18:53] <_joe_> what's up with restbase in esams? [07:19:08] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [07:23:15] (03PS1) 10Marostegui: db-codfw.php: Depool db2089,db2079 and db2065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416898 [07:25:31] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2089,db2079 and db2065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416898 (owner: 10Marostegui) [07:27:09] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2089,db2079 and db2065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416898 (owner: 10Marostegui) [07:27:27] (03CR) 10Elukey: "Thanks! Added a suggestion about the cron hour value.." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/416494 (https://phabricator.wikimedia.org/T188939) (owner: 10Nuria) [07:28:19] (03CR) 10jenkins-bot: db-codfw.php: Depool db2089,db2079 and db2065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416898 (owner: 10Marostegui) [07:28:47] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2089,db2079 and db2065 (duration: 01m 15s) [07:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:06] !log Stop mariadb on db2089,db2079 and db2065 for kernel upgrade [07:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:00] (03PS1) 10Madhuvishy: dumps: Fold distribution related rsync config to profile/ path [puppet] - 10https://gerrit.wikimedia.org/r/416901 (https://phabricator.wikimedia.org/T188726) [07:32:07] RECOVERY - SSH cp5012.mgmt on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) [07:33:07] RECOVERY - SSH cp5011.mgmt on cp5011.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) [07:33:18] RECOVERY - SSH cp5008.mgmt on cp5008.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) [07:38:18] RECOVERY - SSH lvs5001.mgmt on lvs5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) [07:39:38] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2089,db2079 and db2065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416903 [07:41:49] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2089,db2079 and db2065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416903 (owner: 10Marostegui) [07:43:58] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2089,db2079 and db2065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416903 (owner: 10Marostegui) [07:45:37] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2089,db2079 and db2065 after mariadb and kernel upgrade (duration: 01m 16s) [07:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:05] (03PS2) 10Madhuvishy: dumps: Fold distribution related rsync config to profile/ path [puppet] - 10https://gerrit.wikimedia.org/r/416901 (https://phabricator.wikimedia.org/T188726) [07:48:09] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2089,db2079 and db2065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416903 (owner: 10Marostegui) [07:48:27] PROBLEM - HHVM rendering on mw2114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:49:17] RECOVERY - HHVM rendering on mw2114 is OK: HTTP OK: HTTP/1.1 200 OK - 75071 bytes in 0.325 second response time [07:52:41] (03PS3) 10Madhuvishy: dumps: Fold distribution related rsync config to profile/ path [puppet] - 10https://gerrit.wikimedia.org/r/416901 (https://phabricator.wikimedia.org/T188726) [07:55:27] PROBLEM - HHVM rendering on mw2137 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:56:18] RECOVERY - HHVM rendering on mw2137 is OK: HTTP OK: HTTP/1.1 200 OK - 75071 bytes in 0.304 second response time [07:56:48] Morning [07:56:59] Doing any planned maintence this morning? [07:57:14] en.Wikisource.org was loading rather more slowly than usual? [07:57:39] page are loading reaaaly slowly for me too [07:58:02] not an ucommon thing, seems to happen often in European mornings [07:58:07] RECOVERY - SSH cp5002.mgmt on cp5002.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) [07:58:07] RECOVERY - SSH lvs5002.mgmt on lvs5002.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) [07:58:07] RECOVERY - SSH cp5010.mgmt on cp5010.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) [07:58:07] RECOVERY - SSH cp5009.mgmt on cp5009.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) [07:58:07] RECOVERY - SSH cp5007.mgmt on cp5007.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) [07:58:07] RECOVERY - SSH cp5001.mgmt on cp5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) [07:58:07] RECOVERY - SSH bast5001.mgmt on bast5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) [07:58:08] RECOVERY - SSH cp5005.mgmt on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) [07:58:08] RECOVERY - SSH cp5004.mgmt on cp5004.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) [07:58:09] RECOVERY - SSH cp5003.mgmt on cp5003.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) [08:00:21] Nikerabbit: like https://phabricator.wikimedia.org/T189085 ? [08:04:15] <_joe_> Nemo_bis: let me slap the right tags on that ticket [08:04:41] 10Operations, 10Wikimedia-General-or-Unknown, 10Performance: Resources and pages occasionally take seconds to respond or fail - https://phabricator.wikimedia.org/T189085#4031100 (10Joe) p:05Triage>03High [08:04:43] (03PS1) 10Madhuvishy: dumps: Move rsync ferm rules to base rsync config [puppet] - 10https://gerrit.wikimedia.org/r/416908 (https://phabricator.wikimedia.org/T188727) [08:05:58] for the people having slowness issues, can you send the troubleshoting commands listed on https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue [08:06:17] Nemo_bis, Nikerabbit ^ [08:06:38] especially a mtr, or ping+traceroute, would be useful [08:06:39] if it's T189085, it does not look like a network problem [08:06:40] T189085: Resources and pages occasionally take seconds to respond or fail - https://phabricator.wikimedia.org/T189085 [08:07:09] that's a 503 and it's probably issued by our infrastructure, not a network timeout [08:07:29] (03CR) 10Madhuvishy: "https://puppet-compiler.wmflabs.org/compiler02/10303/" [puppet] - 10https://gerrit.wikimedia.org/r/416901 (https://phabricator.wikimedia.org/T188726) (owner: 10Madhuvishy) [08:07:37] PROBLEM - Host bohrium is DOWN: PING CRITICAL - Packet loss = 100% [08:07:38] unless the users reporting this are behind some firewalling+forward proxy (many companies do this) [08:07:38] indeed [08:07:52] <_joe_> akosiaris: no I honestly think there is something wrong with esams [08:07:58] PROBLEM - Host kafkamon1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:08:09] <_joe_> we had an alarm about latencies in requests to restbase from esams only [08:08:13] _joe_: yeah it's a plausible explanation [08:08:17] PROBLEM - Host logstash1008 is DOWN: PING CRITICAL - Packet loss = 100% [08:08:21] ah... ganeti1005 ? [08:08:22] <_joe_> I'm still not sure if it's network or what [08:08:24] <_joe_> yes [08:08:28] PROBLEM - Host sca1004 is DOWN: PING CRITICAL - Packet loss = 100% [08:08:28] I wasn't doing any tests btw [08:08:35] <_joe_> can you look into ganeti? :P [08:08:41] akosaris: I'm UK based [08:08:47] RECOVERY - Host kafkamon1001 is UP: PING OK - Packet loss = 0%, RTA = 1.39 ms [08:08:50] nothing to look into tbh... the usual thing I guess [08:08:55] <_joe_> XioNoX: if I'm right, the issues have been in the esams-eqiad interconnect [08:09:02] Some ISP's in the UK filter/proxy at the backbone level [08:09:07] <_joe_> akosiaris: I mean restarting the thing [08:09:09] PROBLEM - PyBal backends health check on lvs1010 is CRITICAL: PYBAL CRITICAL - CRITICAL - logstash-json-tcp_11514: Servers logstash1008.eqiad.wmnet are marked down but pooled: logstash-log4j_4560: Servers logstash1008.eqiad.wmnet are marked down but pooled: logstash-json-udp_11514_udp: Servers logstash1008.eqiad.wmnet are marked down but pooled: logstash-syslog-tcp_10514: Servers logstash1008.eqiad.wmnet are marked down but poo [08:09:09] g-udp_10514_udp: Servers logstash1008.eqiad.wmnet are marked down but pooled: kibana_80: Servers logstash1008.eqiad.wmnet are marked down but pooled [08:09:17] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - CRITICAL - logstash-json-tcp_11514: Servers logstash1008.eqiad.wmnet are marked down but pooled: logstash-log4j_4560: Servers logstash1008.eqiad.wmnet are marked down but pooled: logstash-json-udp_11514_udp: Servers logstash1008.eqiad.wmnet are marked down but pooled: logstash-syslog-tcp_10514: Servers logstash1008.eqiad.wmnet are marked down but poo [08:09:17] g-udp_10514_udp: Servers logstash1008.eqiad.wmnet are marked down but pooled: kibana_80: Servers logstash1008.eqiad.wmnet are marked down but pooled [08:09:27] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - logstash-json-tcp_11514: Servers logstash1008.eqiad.wmnet are marked down but pooled: logstash-log4j_4560: Servers logstash1008.eqiad.wmnet are marked down but pooled: logstash-json-udp_11514_udp: Servers logstash1008.eqiad.wmnet are marked down but pooled: logstash-syslog-tcp_10514: Servers logstash1008.eqiad.wmnet are marked down but poo [08:09:27] g-udp_10514_udp: Servers logstash1008.eqiad.wmnet are marked down but pooled: kibana_80: Servers logstash1008.eqiad.wmnet are marked down but pooled [08:09:34] It's not a major concern, as the pages will load eventually [08:09:52] ShakespeareFan00: ah the porn filter. Is that really at the backbone level ? [08:10:02] akosiaris: yes [08:10:02] I don't recall it being implemented as a proxy [08:10:14] not that I 've seen it first hand, just reports of it [08:10:17] And its not a porn filter, it's an illegal content filter [08:10:26] lol [08:10:37] RECOVERY - Host bohrium is UP: PING WARNING - Packet loss = 66%, RTA = 0.35 ms [08:10:38] ok true, but still, guess what usually gets classified as illegal [08:10:47] RECOVERY - Host logstash1008 is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [08:10:47] RECOVERY - Host sca1004 is UP: PING OK - Packet loss = 0%, RTA = 1.32 ms [08:10:48] It's also used to filter copyright violating sites, and extremists [08:11:07] <_joe_> Nemo_bis: still seing slowdowns yourself? [08:11:16] ganeti1005 is fine again btw [08:11:37] Another possiblity is that the slowdown may be due to anti-virus software [08:11:51] but I can't think of anything a WMF site does that would flag it as supspect [08:11:53] <_joe_> ShakespeareFan00: you're not the only one who reported those [08:12:02] <_joe_> so there was some issue indeed [08:12:05] <_joe_> it seems gone now [08:12:07] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416909 [08:12:09] <_joe_> at least here [08:12:19] Hmm [08:12:41] Still slow here [08:13:07] RECOVERY - PyBal backends health check on lvs1010 is OK: PYBAL OK - All pools are healthy [08:13:18] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [08:13:38] <_joe_> ShakespeareFan00: care to give me an example, btw [08:13:46] <_joe_> a url that renders slowly [08:14:27] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [08:15:05] Will do so [08:15:06] we had a 3Gbit traffic increase in esams for like 5 mins [08:15:18] that was at 7:30 UTC [08:15:20] And it's clearing :) [08:15:31] Getting reasonable load times at present... [08:15:37] RECOVERY - SSH dns5002.mgmt on dns5002.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) [08:15:38] But if it reccurs I'll let you know [08:20:59] _joe_: From what I've seen the issues are intermittent. Sometimes for a while good without problem then it just becomes which is most noticable when you try to load a diff page [08:21:05] Nemo_bis: yeah looks same, I also got SPDY decoding error once [08:21:25] slow* [08:21:40] <_joe_> I am asking if anyone is having issues *right now* [08:22:49] well i kind of have slowness right now [08:23:48] <_joe_> Wiki13: which pages? can give me one example? [08:24:09] I just had it on a block page [08:24:14] https://nl.wikipedia.org/w/index.php?title=Speciaal:Blokkeren&wpTarget=213.126.46.40%2F29&wpDisableUTEdit=1&wpReason=other&wpReason-other=Herhaald+vandalisme+vanaf+deze+school-IP-range+%28%5B%5BWikipedia%3ABlokkering_van_school%7CMeer+informatie%5D%5D%29 [08:24:31] worked for me right away [08:24:37] It took so long that it finally canceled the request because another blocked in the meanwhile [08:24:44] admin [08:25:09] loaded quickly for me too [08:25:25] yea I don't know but it does now for me too [08:25:29] why [08:26:04] it's very very intermittent :/ [08:26:17] <_joe_> Nikerabbit: you're stil experiencing these problems now? [08:26:31] fyi, im in the eu [08:26:39] <_joe_> because I can see from monitoring we had two issues from say 6:50 utc to 7:30 UTC [08:26:50] so maybe that helps with pinpointing a problem [08:27:12] <_joe_> Wiki13: the problem is clearly with the EU caching pop [08:27:29] <_joe_> sorry, going afk for a while, but others in the SRE team are here [08:27:32] _joe_: well, that page above took 10 seconds to load the main document, if it is the same issue and not just slow page itself [08:27:34] yea it seems that way [08:27:45] I hear others complain about slowness too [08:28:12] but as said before, issues are intermittent and don't seem to stay around for long [08:28:14] (actually that page + timestamp of next line thanks to Konsole bug... if that matters) [08:33:10] 10Operations, 10Wikimedia-General-or-Unknown, 10Performance: Resources and pages occasionally take seconds to respond or fail - https://phabricator.wikimedia.org/T189085#4031109 (10Nemo_bis) [08:34:13] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416909 (owner: 10Marostegui) [08:35:41] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416909 (owner: 10Marostegui) [08:36:21] (03PS2) 10Alexandros Kosiaris: Allow specifying kubelet/kubeproxy username/token [puppet] - 10https://gerrit.wikimedia.org/r/392838 [08:36:23] (03PS2) 10Alexandros Kosiaris: Add kubelet_username, kubeproxy_username hieradata [puppet] - 10https://gerrit.wikimedia.org/r/392839 [08:36:25] (03PS2) 10Alexandros Kosiaris: Use kubelet/kubeproxy specific configs [puppet] - 10https://gerrit.wikimedia.org/r/392842 [08:37:14] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1079 after alter table (duration: 01m 16s) [08:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:06] we will investigate the slowdowns at T189085 , but for now they don't seem to be happening anymore [08:38:07] T189085: Resources and pages occasionally take seconds to respond or fail - https://phabricator.wikimedia.org/T189085 [08:38:14] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416909 (owner: 10Marostegui) [08:40:31] !log Deploy schema change on s7 primary master db1062 - T153182 T185128 [08:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:47] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [08:40:47] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [08:40:48] 10Operations, 10ops-eqiad: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121#4031147 (10MoritzMuehlenhoff) >>! In T181121#4029021, @chasemp wrote: > Is this on 4.4 or only 4.9? We've seen this both on 4.4 and 4.9 [08:43:51] 10Operations, 10ops-eqiad: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121#4031151 (10akosiaris) >>! In T181121#4029021, @chasemp wrote: > >> The following reproduces successfully >> >> * Heavy IO on a DRBD backed device in a VM. > > Is this on 4.4 or only... [08:46:25] (03PS1) 10Marostegui: db2037: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/416912 (https://phabricator.wikimedia.org/T189005) [08:47:26] (03CR) 10Marostegui: [C: 032] db2037: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/416912 (https://phabricator.wikimedia.org/T189005) (owner: 10Marostegui) [09:09:51] 10Operations, 10Traffic, 10Performance: Resources and pages occasionally take seconds to respond or fail - https://phabricator.wikimedia.org/T189085#4031273 (10jcrespo) This has been handled by the traffic engineers and no further problems should happen in the next hours/days (apparently one edge server was... [09:12:09] https://en.wikisource.org/wiki/Page:FBI_Law_Enforcement_Bulletin_54_(8).pdf/22?fromrc=1 slow load [09:14:34] (03PS3) 10Alexandros Kosiaris: Allow specifying kubelet/kubeproxy username/token [puppet] - 10https://gerrit.wikimedia.org/r/392838 [09:14:36] (03PS3) 10Alexandros Kosiaris: Add kubelet_username, kubeproxy_username hieradata [puppet] - 10https://gerrit.wikimedia.org/r/392839 [09:14:38] (03PS3) 10Alexandros Kosiaris: Use kubelet/kubeproxy specific configs [puppet] - 10https://gerrit.wikimedia.org/r/392842 [09:16:54] !log rebooting sarin for kernel security update [09:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:40] (03CR) 10Alexandros Kosiaris: [C: 032] Allow specifying kubelet/kubeproxy username/token [puppet] - 10https://gerrit.wikimedia.org/r/392838 (owner: 10Alexandros Kosiaris) [09:17:42] (03CR) 10Alexandros Kosiaris: [C: 032] Add kubelet_username, kubeproxy_username hieradata [puppet] - 10https://gerrit.wikimedia.org/r/392839 (owner: 10Alexandros Kosiaris) [09:17:45] (03CR) 10Alexandros Kosiaris: [C: 032] Use kubelet/kubeproxy specific configs [puppet] - 10https://gerrit.wikimedia.org/r/392842 (owner: 10Alexandros Kosiaris) [09:24:43] !log rearming keyholder on sarin after reboot [09:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:47] PROBLEM - Request latencies on argon is CRITICAL: CRITICAL - apiserver_request_latencies is 1903630 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:27:16] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-rus: New upstream release [debs/contenttranslation/apertium-rus] - 10https://gerrit.wikimedia.org/r/407202 (https://phabricator.wikimedia.org/T184901) (owner: 10KartikMistry) [09:27:18] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-ukr: Initial Debian packaging [debs/contenttranslation/apertium-ukr] - 10https://gerrit.wikimedia.org/r/408264 (https://phabricator.wikimedia.org/T184901) (owner: 10KartikMistry) [09:31:36] !log rebooting darmstadtium (docker registry) for kernel security update [09:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:26] !log rebooting etherpad1001 (etherpad.wikimedia.org) for kernel security update [09:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:06] !log upload apertium-rus_0.2.0~r82706-1+wmf1 and apertium-ukr_0.1.0~r82563-1+wmf1 on apt.wikimedia.org/jessie-wikimedia/main. T184901 [09:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:21] T184901: Add apertium-rus-ukr MT language pair - https://phabricator.wikimedia.org/T184901 [09:53:25] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-rus-ukr] - 10https://gerrit.wikimedia.org/r/408508 (https://phabricator.wikimedia.org/T184901) (owner: 10KartikMistry) [09:54:09] PROBLEM - Request latencies on acrux is CRITICAL: CRITICAL - apiserver_request_latencies is 1389237 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:54:42] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-rus-ukr: Initial Debian packaging [debs/contenttranslation/apertium-rus-ukr] - 10https://gerrit.wikimedia.org/r/408508 (https://phabricator.wikimedia.org/T184901) (owner: 10KartikMistry) [09:56:42] moritzm: if you have 5 mins I have a ticket I'd love you to look at (some odd hhvm thing happened) [09:56:42] !log rebooting tureis/roentgenium for kernel security update [09:56:55] *tries to find it* [09:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:01] addshore: yeah, saw that, will have a look later on [09:57:08] sweet :) [10:02:14] !log upload apertium-rus-ukr_0.2.0~r82706-1+wmf1 on apt.wikimedia.org/jessie-wikimedia/main. T184901 [10:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:29] T184901: Add apertium-rus-ukr MT language pair - https://phabricator.wikimedia.org/T184901 [10:03:23] !log rebooting pool counters in codfw for kernel security update [10:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:48] !log reboot analytics10[35,52] for kernel updates - hadoop hdfs journal nodes (didn't manage to complete the work yesterday) [10:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:43] (03PS2) 10Gehel: wdqs: replace ::base::firewall with the appropriate profile [puppet] - 10https://gerrit.wikimedia.org/r/416701 [10:05:38] !log rebooting openldap/corp servers for kernel security update [10:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:07] (03CR) 10Gehel: [C: 032] wdqs: replace ::base::firewall with the appropriate profile [puppet] - 10https://gerrit.wikimedia.org/r/416701 (owner: 10Gehel) [10:11:10] !log rebooting openldap/WMCS servers for kernel security update [10:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:55] (03PS1) 10Gehel: wdqs: enable LDF server is now configurable [puppet] - 10https://gerrit.wikimedia.org/r/416921 (https://phabricator.wikimedia.org/T187766) [10:24:19] (03PS4) 10Jcrespo: mariadb-backups: Change backup format to YYYY-MM-dd--HH-mm-SS [puppet] - 10https://gerrit.wikimedia.org/r/415608 (https://phabricator.wikimedia.org/T184696) [10:24:31] (03CR) 10Gehel: "Puppet compiler agrees this is a noop: https://puppet-compiler.wmflabs.org/compiler02/10307/" [puppet] - 10https://gerrit.wikimedia.org/r/416921 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel) [10:26:23] !log rebooting boron for kernel security update [10:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:05] !log rebooting netmon1002 for kernel security update [10:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:28] PROBLEM - Nginx local proxy to apache on mw2203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:33:18] RECOVERY - Nginx local proxy to apache on mw2203 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 1.253 second response time [10:34:34] !log rebooting netmon2001 for kernel security update [10:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:48] (03PS2) 10Gehel: wdqs: enable LDF server is now configurable [puppet] - 10https://gerrit.wikimedia.org/r/416921 (https://phabricator.wikimedia.org/T187766) [10:34:50] (03PS3) 10Gehel: [WIP] wdqs: configure the new internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/415872 (https://phabricator.wikimedia.org/T187766) [10:35:30] (03CR) 10jerkins-bot: [V: 04-1] [WIP] wdqs: configure the new internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/415872 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel) [10:37:04] (03PS4) 10Gehel: [WIP] wdqs: configure the new internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/415872 (https://phabricator.wikimedia.org/T187766) [10:40:09] PROBLEM - Hadoop NodeManager on analytics1052 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [10:40:37] sorryyy [10:40:55] (03CR) 10Jcrespo: [C: 032] mariadb-backups: Change backup format to YYYY-MM-dd--HH-mm-SS [puppet] - 10https://gerrit.wikimedia.org/r/415608 (https://phabricator.wikimedia.org/T184696) (owner: 10Jcrespo) [10:46:43] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I like a lot the general approach, but I think we should either transform the service to check the last index in a systemd timer, or make " (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/413355 (https://phabricator.wikimedia.org/T182597) (owner: 10Volans) [10:48:23] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I'd prefer to see this added to role::mediawiki::web for now. In its current form, it will fail on the jobrunners." [puppet] - 10https://gerrit.wikimedia.org/r/413356 (https://phabricator.wikimedia.org/T182597) (owner: 10Volans) [10:48:28] (03CR) 10Volans: "I've no context on the wdqs part, just reviewed the Puppet data types, see inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/416921 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel) [10:49:02] !log reboot memcached hosts in codfw for kernel security update [10:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:47] (03CR) 10Giuseppe Lavagetto: wdqs: enable LDF server is now configurable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/416921 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel) [10:49:59] <_joe_> gehel: <3 for using data types [10:50:52] !log reboot stat100[56] for kernel upgrades [10:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:48] (03PS3) 10Gehel: wdqs: enable LDF server is now configurable [puppet] - 10https://gerrit.wikimedia.org/r/416921 (https://phabricator.wikimedia.org/T187766) [10:51:50] (03PS5) 10Gehel: [WIP] wdqs: configure the new internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/415872 (https://phabricator.wikimedia.org/T187766) [10:51:53] (03CR) 10Volans: "> Patch Set 8: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/413356 (https://phabricator.wikimedia.org/T182597) (owner: 10Volans) [10:52:25] (03CR) 10Gehel: wdqs: enable LDF server is now configurable (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/416921 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel) [10:53:28] _joe_: My first language is Java, those datatypes makes me feel much more at home! [10:54:05] <_joe_> gehel: well remember, they're a very different things here [10:54:13] <_joe_> they're basically a validation system :P [10:54:55] !log rearmed keyholders on netmon1002 and netmon2001 [10:54:57] akosiaris: we need to merge, https://gerrit.wikimedia.org/r/c/408986/ - when you've time. [10:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:20] !log Deploy schema change on codfw s4 master (db2051) with replication enabled (this will generate lag on codfw) - T187089 T185128 T153182 [10:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:35] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [10:55:36] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [10:55:36] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [10:56:46] _joe_: that's JSR308 in Java, so yeah, same thing [10:57:10] (03PS2) 10Jcrespo: mariadb-backups: Allow backup consolidation and recovery [puppet] - 10https://gerrit.wikimedia.org/r/416353 (https://phabricator.wikimedia.org/T184696) [10:57:19] gehel: this is super nice! https://gerrit.wikimedia.org/r/#/c/416921/3/modules/wmflib/types/ipport.pp [10:57:21] <_joe_> JSR308 loops like the made-up astronomical name for a pulsar [10:57:22] <_joe_> :P [10:57:44] <_joe_> although those are usually PSR-J1234 [10:57:44] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Allow backup consolidation and recovery [puppet] - 10https://gerrit.wikimedia.org/r/416353 (https://phabricator.wikimedia.org/T184696) (owner: 10Jcrespo) [10:58:02] yeah, the standardization process in Java is a complete naming mess! [10:58:24] s/standardization/specification/ [10:59:19] RECOVERY - Hadoop NodeManager on analytics1052 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [11:03:11] (03PS3) 10Jcrespo: mariadb-backups: Allow backup consolidation and recovery [puppet] - 10https://gerrit.wikimedia.org/r/416353 (https://phabricator.wikimedia.org/T184696) [11:03:41] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Allow backup consolidation and recovery [puppet] - 10https://gerrit.wikimedia.org/r/416353 (https://phabricator.wikimedia.org/T184696) (owner: 10Jcrespo) [11:04:36] (03PS2) 10Alexandros Kosiaris: Add apertium-ukr and apertium-rus-ukr packages [puppet] - 10https://gerrit.wikimedia.org/r/408986 (https://phabricator.wikimedia.org/T184901) (owner: 10KartikMistry) [11:04:40] (03CR) 10Alexandros Kosiaris: [C: 032] Add apertium-ukr and apertium-rus-ukr packages [puppet] - 10https://gerrit.wikimedia.org/r/408986 (https://phabricator.wikimedia.org/T184901) (owner: 10KartikMistry) [11:04:45] pep8 works on my dev env, but not on beta, different versions? [11:09:44] nope, failure on my part on chageset [11:15:47] 10Operations, 10DBA: Meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107#4031557 (10Marostegui) [11:16:08] jynus: I 've merged you change as well on puppetmasters [11:16:18] 10Operations, 10DBA, 10Epic: Meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107#4031569 (10Marostegui) p:05Triage>03Normal [11:17:27] 10Operations, 10DBA, 10Epic: Meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107#4031557 (10Marostegui) [11:17:57] 10Operations, 10DBA, 10Epic: Meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107#4031557 (10Marostegui) [11:18:37] thanks [11:19:15] wait, which one, I merged the last one, and the next didn't go through yet [11:19:39] oh, apparently I ran puppet agent on puppetmaster [11:28:42] (03CR) 10Volans: [C: 031] "LGTM for the Puppet part! Compiler seems happy:" [puppet] - 10https://gerrit.wikimedia.org/r/416921 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel) [11:30:24] 10Operations, 10Packaging, 10Scap: Install git-lfs client (at least on scap targets & masters) - https://phabricator.wikimedia.org/T180628#4031604 (10akosiaris) >>! In T180628#4028541, @mmodell wrote: > @akosiaris: What would it take to get the git-lfs package back-ported to stretch? It's written in go, howe... [11:36:18] (03PS3) 10Giuseppe Lavagetto: site.pp: reorganize MediaWiki appservers in codfw for role/row [puppet] - 10https://gerrit.wikimedia.org/r/404498 [11:40:26] 10Operations, 10MediaWiki-Configuration, 10User-Joe, 10discovery-system: Test EtcdConfig in different failure scenarios - https://phabricator.wikimedia.org/T185078#4031627 (10Joe) So some results from different runs of testing using `ab` to render https://en.wikipedia.org/wiki/Francesco_Totti on `mwdebug10... [11:41:02] akosiaris: thanks! [11:41:05] (03PS4) 10Jcrespo: mariadb-backups: Allow backup consolidation and recovery [puppet] - 10https://gerrit.wikimedia.org/r/416353 (https://phabricator.wikimedia.org/T184696) [11:41:43] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Allow backup consolidation and recovery [puppet] - 10https://gerrit.wikimedia.org/r/416353 (https://phabricator.wikimedia.org/T184696) (owner: 10Jcrespo) [11:41:51] (03CR) 10Jcrespo: "Adding alex because we do some bacula cleanup he will probably agree with." [puppet] - 10https://gerrit.wikimedia.org/r/416353 (https://phabricator.wikimedia.org/T184696) (owner: 10Jcrespo) [11:45:50] (03PS5) 10Jcrespo: mariadb-backups: Allow backup consolidation and recovery [puppet] - 10https://gerrit.wikimedia.org/r/416353 (https://phabricator.wikimedia.org/T184696) [11:46:24] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Allow backup consolidation and recovery [puppet] - 10https://gerrit.wikimedia.org/r/416353 (https://phabricator.wikimedia.org/T184696) (owner: 10Jcrespo) [11:49:00] (03PS6) 10Jcrespo: mariadb-backups: Allow backup consolidation and recovery [puppet] - 10https://gerrit.wikimedia.org/r/416353 (https://phabricator.wikimedia.org/T184696) [11:49:30] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Allow backup consolidation and recovery [puppet] - 10https://gerrit.wikimedia.org/r/416353 (https://phabricator.wikimedia.org/T184696) (owner: 10Jcrespo) [11:50:28] PROBLEM - Apache HTTP on mw2211 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:51:18] RECOVERY - Apache HTTP on mw2211 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.122 second response time [11:54:52] (03PS7) 10Jcrespo: mariadb-backups: Allow backup consolidation and recovery [puppet] - 10https://gerrit.wikimedia.org/r/416353 (https://phabricator.wikimedia.org/T184696) [12:01:31] 10Operations, 10MediaWiki-Configuration, 10MW-1.31-release-notes (WMF-deploy-2018-02-27 (1.31.0-wmf.23)), 10Patch-For-Review, 10discovery-system: Use EtcdConfig in production to allow automation of a datacenter switch - https://phabricator.wikimedia.org/T182597#4031677 (10Joe) [12:01:34] 10Operations, 10MediaWiki-Configuration, 10User-Joe, 10discovery-system: Test EtcdConfig in different failure scenarios - https://phabricator.wikimedia.org/T185078#4031676 (10Joe) 05Open>03Resolved [12:02:38] PROBLEM - DPKG on labstore1003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:03:36] 10Operations, 10ops-codfw, 10User-Elukey: rack/setup/install mw2259-mw2290 - https://phabricator.wikimedia.org/T188301#4031683 (10Joe) let's say for now I'll just proceed decommissioning old machines, then we can just reassess the situation. [12:03:38] RECOVERY - DPKG on labstore1003 is OK: All packages OK [12:08:47] 10Operations, 10ops-codfw, 10hardware-requests, 10User-Elukey: decommission mw2097-mw2134 - https://phabricator.wikimedia.org/T189111#4031698 (10Joe) p:05Triage>03Normal [12:11:15] (03PS1) 10Urbanecm: Add gor to langs.tmpl [dns] - 10https://gerrit.wikimedia.org/r/416929 [12:11:27] (03CR) 10jerkins-bot: [V: 04-1] Add gor to langs.tmpl [dns] - 10https://gerrit.wikimedia.org/r/416929 (owner: 10Urbanecm) [12:11:30] (03PS2) 10Urbanecm: Add gor to langs.tmpl [dns] - 10https://gerrit.wikimedia.org/r/416929 (https://phabricator.wikimedia.org/T189109) [12:11:42] (03CR) 10jerkins-bot: [V: 04-1] Add gor to langs.tmpl [dns] - 10https://gerrit.wikimedia.org/r/416929 (https://phabricator.wikimedia.org/T189109) (owner: 10Urbanecm) [12:24:58] (03CR) 10Marostegui: mariadb-backups: Allow backup consolidation and recovery (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/416353 (https://phabricator.wikimedia.org/T184696) (owner: 10Jcrespo) [12:30:09] PROBLEM - Request latencies on acrux is CRITICAL: CRITICAL - apiserver_request_latencies is 1788375 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:34:02] (03PS1) 10Urbanecm: Initial configuration for gorwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416930 [12:34:18] 10Operations, 10Packaging, 10Scap: Install git-lfs client (at least on scap targets & masters) - https://phabricator.wikimedia.org/T180628#4031758 (10akosiaris) I 've just uploaded `git-lfs` to stretch-wikimedia/main. [12:34:31] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4031760 (10Marostegui) In order to replace db1020 (m2 master) and following: https://gerrit.wikimedia.org/r/#/c/399792/3/wmf-config/db-eqiad.php I woul... [12:39:27] (03PS8) 10Jcrespo: mariadb-backups: Allow backup consolidation and recovery [puppet] - 10https://gerrit.wikimedia.org/r/416353 (https://phabricator.wikimedia.org/T184696) [12:40:03] (03PS9) 10Jcrespo: mariadb-backups: Allow backup consolidation and recovery [puppet] - 10https://gerrit.wikimedia.org/r/416353 (https://phabricator.wikimedia.org/T184696) [12:40:33] (03PS2) 10Urbanecm: Initial configuration for gorwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416930 (https://phabricator.wikimedia.org/T189109) [12:44:49] (03CR) 10Jcrespo: "I have made some amends based on your comment. I do not know how to do the "source" of backups- I want to avoid all references to latest/e" [puppet] - 10https://gerrit.wikimedia.org/r/416353 (https://phabricator.wikimedia.org/T184696) (owner: 10Jcrespo) [12:46:56] (03CR) 10Marostegui: "> I have made some amends based on your comment. I do not know how to" [puppet] - 10https://gerrit.wikimedia.org/r/416353 (https://phabricator.wikimedia.org/T184696) (owner: 10Jcrespo) [12:47:34] 10Operations, 10User-Joe: Create 2 VMs in codfw for mwdebug20001 and 2002 - https://phabricator.wikimedia.org/T187468#4031804 (10Joe) a:03Joe [12:48:37] (03PS2) 10ArielGlenn: dumps: Switch rsyncer profile to use host settings from hiera [puppet] - 10https://gerrit.wikimedia.org/r/416863 (https://phabricator.wikimedia.org/T188726) (owner: 10Madhuvishy) [12:49:26] (03CR) 10ArielGlenn: [C: 032] dumps: Switch rsyncer profile to use host settings from hiera [puppet] - 10https://gerrit.wikimedia.org/r/416863 (https://phabricator.wikimedia.org/T188726) (owner: 10Madhuvishy) [12:49:34] (03PS10) 10Jcrespo: mariadb-backups: Allow backup consolidation and recovery [puppet] - 10https://gerrit.wikimedia.org/r/416353 (https://phabricator.wikimedia.org/T184696) [12:50:04] (03PS1) 10Giuseppe Lavagetto: Add mwdebug2001/2002 in row A and B respectively. [dns] - 10https://gerrit.wikimedia.org/r/416932 (https://phabricator.wikimedia.org/T187468) [13:00:04] mobrovac and Pchelolo: It is that lovely time of the day again! You are hereby commanded to deploy JobQueue Deployment Window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180307T1300). [13:00:05] No GERRIT patches in the queue for this window AFAICS. [13:01:01] yup yup [13:02:17] (03CR) 10Marostegui: [C: 031] mariadb-backups: Allow backup consolidation and recovery [puppet] - 10https://gerrit.wikimedia.org/r/416353 (https://phabricator.wikimedia.org/T184696) (owner: 10Jcrespo) [13:04:11] (03PS2) 10ArielGlenn: dumps: Move rsyncer to distribution profile path and rename [puppet] - 10https://gerrit.wikimedia.org/r/416866 (https://phabricator.wikimedia.org/T188726) (owner: 10Madhuvishy) [13:05:21] (03CR) 10ArielGlenn: [C: 032] dumps: Move rsyncer to distribution profile path and rename [puppet] - 10https://gerrit.wikimedia.org/r/416866 (https://phabricator.wikimedia.org/T188726) (owner: 10Madhuvishy) [13:07:32] 10Operations, 10Packaging, 10Scap: Install git-lfs client (at least on scap targets & masters) - https://phabricator.wikimedia.org/T180628#4031839 (10Paladox) @akosiaris would it be easy to backport that to Jessie too? [13:09:34] (03PS1) 10Arturo Borrero Gonzalez: toollabs: apt_pinning: pin more nss/ldap/pam packages [puppet] - 10https://gerrit.wikimedia.org/r/416934 (https://phabricator.wikimedia.org/T187193) [13:10:05] (03CR) 10jerkins-bot: [V: 04-1] toollabs: apt_pinning: pin more nss/ldap/pam packages [puppet] - 10https://gerrit.wikimedia.org/r/416934 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez) [13:11:02] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/compiler02/10313/es2001.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/416353 (https://phabricator.wikimedia.org/T184696) (owner: 10Jcrespo) [13:12:28] (03PS5) 10Mobrovac: Swith all refreshLinks jobs to Kafka. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416476 (https://phabricator.wikimedia.org/T185052) (owner: 10Ppchelko) [13:12:29] PROBLEM - HHVM rendering on mw1295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:13:19] RECOVERY - HHVM rendering on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 75049 bytes in 0.151 second response time [13:14:13] marostegui: heads up, i'm taking over /srv/mediawiki-staging on tin :) [13:15:26] mobrovac: sure - I am not deploying anything from tin today :) [13:15:28] (03CR) 10Mobrovac: [C: 032] Swith all refreshLinks jobs to Kafka. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416476 (https://phabricator.wikimedia.org/T185052) (owner: 10Ppchelko) [13:16:39] (03Merged) 10jenkins-bot: Swith all refreshLinks jobs to Kafka. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416476 (https://phabricator.wikimedia.org/T185052) (owner: 10Ppchelko) [13:18:21] (03CR) 10jenkins-bot: Swith all refreshLinks jobs to Kafka. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416476 (https://phabricator.wikimedia.org/T185052) (owner: 10Ppchelko) [13:19:30] !log ppchelko@tin Started deploy [cpjobqueue/deploy@d84286a]: Switch all refreshLinks jobs to kafka T185052 [13:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:50] T185052: Migrate RefreshLinks job to kafka - https://phabricator.wikimedia.org/T185052 [13:20:08] !log rebooting install2002 for kernel security update [13:20:13] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@d84286a]: Switch all refreshLinks jobs to kafka T185052 (duration: 00m 43s) [13:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:53] (03PS6) 10ArielGlenn: dumps: Split up rsync config to base, mirrors, and datasets [puppet] - 10https://gerrit.wikimedia.org/r/416869 (https://phabricator.wikimedia.org/T188726) (owner: 10Madhuvishy) [13:21:03] !log mobrovac@tin Synchronized wmf-config/jobqueue.php: Switch all refreshLinks jobs to EventBus - T185052 (duration: 01m 15s) [13:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:50] (03CR) 10ArielGlenn: [C: 032] dumps: Split up rsync config to base, mirrors, and datasets [puppet] - 10https://gerrit.wikimedia.org/r/416869 (https://phabricator.wikimedia.org/T188726) (owner: 10Madhuvishy) [13:22:39] !log rebooting tungsten for kernel security update [13:22:50] !log mobrovac@tin Synchronized wmf-config/InitialiseSettings.php: Switch all refreshLinks jobs to EventBus, file #2 - T185052 (duration: 01m 15s) [13:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:59] (03PS4) 10Gehel: wdqs: enable LDF server is now configurable [puppet] - 10https://gerrit.wikimedia.org/r/416921 (https://phabricator.wikimedia.org/T187766) [13:27:55] (03CR) 10Gehel: [C: 032] wdqs: enable LDF server is now configurable [puppet] - 10https://gerrit.wikimedia.org/r/416921 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel) [13:32:04] (03PS3) 10Muehlenhoff: Reimage mc2036 after mainboard replacement [puppet] - 10https://gerrit.wikimedia.org/r/415822 (https://phabricator.wikimedia.org/T188587) [13:36:28] (03PS2) 10Arturo Borrero Gonzalez: toollabs: apt_pinning: pin more nss/ldap/pam packages [puppet] - 10https://gerrit.wikimedia.org/r/416934 (https://phabricator.wikimedia.org/T187193) [13:36:48] (03PS1) 10Odder: Update logos for Limburgish and Picardic Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416939 (https://phabricator.wikimedia.org/T189116) [13:38:46] (03CR) 10Arturo Borrero Gonzalez: [C: 032] toollabs: apt_pinning: pin more nss/ldap/pam packages [puppet] - 10https://gerrit.wikimedia.org/r/416934 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez) [13:39:33] (03CR) 10Muehlenhoff: [C: 032] Reimage mc2036 after mainboard replacement [puppet] - 10https://gerrit.wikimedia.org/r/415822 (https://phabricator.wikimedia.org/T188587) (owner: 10Muehlenhoff) [13:39:40] (03PS4) 10Muehlenhoff: Reimage mc2036 after mainboard replacement [puppet] - 10https://gerrit.wikimedia.org/r/415822 (https://phabricator.wikimedia.org/T188587) [13:39:49] (03CR) 10Muehlenhoff: [V: 032 C: 032] Reimage mc2036 after mainboard replacement [puppet] - 10https://gerrit.wikimedia.org/r/415822 (https://phabricator.wikimedia.org/T188587) (owner: 10Muehlenhoff) [13:41:49] (03PS6) 10Gehel: [WIP] wdqs: configure the new internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/415872 (https://phabricator.wikimedia.org/T187766) [13:45:18] (03PS4) 10ArielGlenn: dumps: Fold distribution related rsync config to profile/ path [puppet] - 10https://gerrit.wikimedia.org/r/416901 (https://phabricator.wikimedia.org/T188726) (owner: 10Madhuvishy) [13:45:59] (03CR) 10ArielGlenn: [C: 032] dumps: Fold distribution related rsync config to profile/ path [puppet] - 10https://gerrit.wikimedia.org/r/416901 (https://phabricator.wikimedia.org/T188726) (owner: 10Madhuvishy) [13:51:04] jouncebot: next [13:51:04] In 0 hour(s) and 8 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180307T1400) [13:51:51] (03PS2) 10Odder: Update logos for Limburgish and Picardic Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416939 (https://phabricator.wikimedia.org/T189116) [13:54:25] 10Operations, 10ops-codfw: mc2036 mainboard fuse failure - https://phabricator.wikimedia.org/T185587#3920417 (10MoritzMuehlenhoff) @Papaul : When you're back in the data centre, can you please check the serial console? I tried to reimage the host, but it failed: ``` 13:51:07 | mc2036.codfw.wmnet | Unable to r... [13:55:00] (03PS7) 10Gehel: [WIP] wdqs: configure the new internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/415872 (https://phabricator.wikimedia.org/T187766) [13:55:02] (03PS1) 10Gehel: wdqs: cleanup reference to wdqs class as default parameter of wdqs::gui [puppet] - 10https://gerrit.wikimedia.org/r/416941 [13:57:41] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416585 (https://phabricator.wikimedia.org/T188626) (owner: 10Urbanecm) [13:58:27] (03CR) 10Gehel: "Puppet compiler confirms this is a NOOP: https://puppet-compiler.wmflabs.org/compiler02/10316/" [puppet] - 10https://gerrit.wikimedia.org/r/416941 (owner: 10Gehel) [13:58:47] (03CR) 10Zfilipin: [C: 031] Load Wikibase Quality extensions using extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416739 (https://phabricator.wikimedia.org/T106104) (owner: 10Lucas Werkmeister (WMDE)) [13:58:55] (03Merged) 10jenkins-bot: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416585 (https://phabricator.wikimedia.org/T188626) (owner: 10Urbanecm) [13:59:09] (03CR) 10jenkins-bot: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416585 (https://phabricator.wikimedia.org/T188626) (owner: 10Urbanecm) [13:59:20] (03PS8) 10Gehel: wdqs: configure the new internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/415872 (https://phabricator.wikimedia.org/T187766) [14:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180307T1400). [14:00:04] Urbanecm, Amir1, and Lucas_WMDE: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:10] o/ [14:00:17] I’m here [14:00:20] I can SWAT today [14:00:42] I'll deploy config rules, then you can take over Amir1, sounds good? [14:01:08] Lucas_WMDE: merging your commit, I'll let you know when it's at mwdebug1002 for testing [14:01:16] ok thanks [14:01:29] PROBLEM - HHVM rendering on mw2200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:01:35] yup [14:01:40] zeljkof: not testable [14:01:49] !log setting trace probability to 0.0, restbase eqiad cassandra cluster - T189057 [14:01:55] Amir1: your commits are not testable? [14:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:04] T189057: Understand (and if possible, improve) performance of new storage strategy - https://phabricator.wikimedia.org/T189057 [14:02:20] zeljkof: yes, they only affect maintaince scripts [14:02:20] RECOVERY - HHVM rendering on mw2200 is OK: HTTP OK: HTTP/1.1 200 OK - 75049 bytes in 0.292 second response time [14:02:25] that I will run later [14:02:36] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416739 (https://phabricator.wikimedia.org/T106104) (owner: 10Lucas Werkmeister (WMDE)) [14:02:59] !log zfilipin@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:416585|New throttle rule (T188626)]] (duration: 01m 18s) [14:03:07] Urbanecm: your commit is deployed ^ [14:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:14] T188626: NASA Johnson Space Center WELL ERG National Women's History Month Edit-a-thon - https://phabricator.wikimedia.org/T188626 [14:03:48] (03PS1) 10Ppchelko: Stop reading refreshLinks jobs from the Redis queue. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416942 (https://phabricator.wikimedia.org/T185052) [14:03:50] (03Merged) 10jenkins-bot: Load Wikibase Quality extensions using extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416739 (https://phabricator.wikimedia.org/T106104) (owner: 10Lucas Werkmeister (WMDE)) [14:04:07] (03CR) 10jenkins-bot: Load Wikibase Quality extensions using extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416739 (https://phabricator.wikimedia.org/T106104) (owner: 10Lucas Werkmeister (WMDE)) [14:06:02] (03PS1) 10Arturo Borrero Gonzalez: toollabs: apt_pinning: add NFS package pinning [puppet] - 10https://gerrit.wikimedia.org/r/416943 (https://phabricator.wikimedia.org/T189018) [14:06:09] Amir1: both of your commits have failed jenkins jobs o.O [14:06:49] Lucas_WMDE: your commit is at mwdebug1002, please test and let me know if I can deploy [14:06:55] will do [14:06:56] (03CR) 10Arturo Borrero Gonzalez: [C: 032] toollabs: apt_pinning: add NFS package pinning [puppet] - 10https://gerrit.wikimedia.org/r/416943 (https://phabricator.wikimedia.org/T189018) (owner: 10Arturo Borrero Gonzalez) [14:07:03] zeljkof: yup, phan has been failing for a very long time now, hashar deployed before by force merging [14:07:21] Amir1: that's scary :/ [14:07:36] why aren't the failing jobs fixed or disabled? [14:07:46] selenium fails for another reason, because the fix for the selenium tests (which touches the tests only) was merged just after the branch cut [14:08:01] zeljkof: extension is still working, so I assume the extension registration is working as expected [14:08:10] Lucas_WMDE: ok to deploy? [14:08:14] should be, yes [14:08:14] zeljkof: it has been added and become voting recently for whatever reason that is unknown to me [14:08:19] Lucas_WMDE: ok, deploying [14:08:22] thanks [14:08:29] Amir1: ah, so it's a recent thing [14:08:34] * Lucas_WMDE looks anxiously at db1109 traffic ^^ [14:08:41] but phan never passed in Wikibase [14:08:54] (03CR) 10Giuseppe Lavagetto: [C: 032] Add mwdebug2001/2002 in row A and B respectively. [dns] - 10https://gerrit.wikimedia.org/r/416932 (https://phabricator.wikimedia.org/T187468) (owner: 10Giuseppe Lavagetto) [14:09:20] some devs are fixing errors gradually but I doubt that anything happen any time soon [14:09:44] Amir1: that sounds like a good reason to make the job non voting, or to remove it from CI [14:10:00] !log zfilipin@tin Synchronized wmf-config/: SWAT: [[gerrit:416739|Load Wikibase Quality extensions using extension registration (T106104)]] (duration: 01m 17s) [14:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:15] T106104: [Task] Convert WikibaseQuality, WikibaseQualityConstraints and WikibaseQualityExternalValidation to use extension registration - https://phabricator.wikimedia.org/T106104 [14:10:18] Lucas_WMDE: deployed, please check and thanks for deploying with #releng ;) [14:10:26] zeljkof: thank you very much :) [14:10:48] Amir1: SWAT is all yours, please remember to close the window with `!log EU SWAT finished` or something similar :) [14:11:17] yeah. I just want to know who added it, they might have reasons :D [14:11:28] addshore: did you added it? [14:11:56] added what? [14:12:05] Lucas_WMDE: kein Problem (or something like that, my German-fu is white belt) :) [14:13:30] addshore: phan checks https://gerrit.wikimedia.org/r/#/c/416744/ [14:13:49] yes, they pas on master thought right? [14:14:05] nope, phan never passed on master AFAIK [14:14:48] Thiemo has been fixing it for a while now [14:15:18] oh [14:15:20] hmmmmm [14:15:30] then it shouldnt be voting, but it was working... [14:15:31] hm, mwext-php70-phan-docker is listed as success on https://gerrit.wikimedia.org/r/c/416705/ [14:15:42] the non-voting job is a different one [14:16:16] hmm, I didn't know that [14:16:24] Did Thiemo fix everything? [14:17:27] Amir1: in https://gerrit.wikimedia.org/r/#/c/330280/ in MAY phan was working [14:17:33] !log reducing compression chunk length to 32kb on "wikipedia_T_page__summary".data - T189057 [14:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:48] T189057: Understand (and if possible, improve) performance of new storage strategy - https://phabricator.wikimedia.org/T189057 [14:18:08] and in a recently merged thing https://gerrit.wikimedia.org/r/#/c/416938/ phan is successful on master [14:18:10] (03CR) 10ArielGlenn: [C: 04-1] "This change removes the rsync ferm rules from ms1001. While that server is not in active service, having it rsync-capable is handy For ex" [puppet] - 10https://gerrit.wikimedia.org/r/416908 (https://phabricator.wikimedia.org/T188727) (owner: 10Madhuvishy) [14:18:24] I was under the impression that it’s always working in master, but for some reason fails on backported branches [14:18:28] you wont be able to merge anything if phan fails... unless you have been +2Ving things for ages on master [14:18:50] addshore: This is one thing: https://gerrit.wikimedia.org/r/#/c/403431/ [14:19:08] <_joe_> !log adding mwdebug200{1,2} to ganeti in codfw, T187468 [14:19:16] yeah [14:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:23] T187468: Create 2 VMs in codfw for mwdebug20001 and 2002 - https://phabricator.wikimedia.org/T187468 [14:19:47] addshore: See this: https://gerrit.wikimedia.org/r/#/c/415319/ [14:19:58] 10Operations, 10DNS, 10Mail, 10Traffic: DNS Change for GreenHouse - https://phabricator.wikimedia.org/T103893#4031999 (10faidon) [14:20:01] 10Operations, 10DNS, 10Mail, 10Traffic: SPF for Greenhouse - https://phabricator.wikimedia.org/T189065#4029896 (10faidon) This has been discussed in bigger requests a couple of times before (T103893, T84201) for Greenhouse specfically, plus a bunch of other times for other third-party services. The TL;DR i... [14:20:23] Amir1: indeed, but thats a branch, so maybe something odd happens then [14:20:40] 10Operations, 10vm-requests, 10Patch-For-Review, 10User-Joe: Create 2 VMs in codfw for mwdebug20001 and 2002 - https://phabricator.wikimedia.org/T187468#4032001 (10akosiaris) [14:21:57] addshore: definitely worth a phabricator ticket [14:22:05] *agrees* [14:26:47] 10Operations, 10Packaging, 10Scap: Install git-lfs client (at least on scap targets & masters) - https://phabricator.wikimedia.org/T180628#4032006 (10mmodell) @akosiaris Awesome, thank you! [14:27:19] !log ladsgroup@tin Synchronized php-1.31.0-wmf.24/extensions/Wikibase/lib/includes/Sites/SiteMatrixParser.php: [[gerrit:416744|Add code of special wikis as interwiki when populating sites table (T183019)]] (duration: 01m 16s) [14:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:33] T183019: Wikibase must not insert local recentchanges entries for nonexistent local users (days: 5) - https://phabricator.wikimedia.org/T183019 [14:28:37] zeljkof: should I sync the test files too? [14:29:05] Amir1: yes, I usually sync the entire extension [14:29:53] noted [14:29:56] Thanks [14:31:01] !log ladsgroup@tin Synchronized php-1.31.0-wmf.24/extensions/Wikibase/lib/tests/phpunit/Sites/SiteMatrixParserTest.php: [[gerrit:416744|Add code of special wikis as interwiki when populating sites table, part II (T183019)]] (duration: 01m 15s) [14:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:47] !log ladsgroup@tin Synchronized php-1.31.0-wmf.23/extensions/Wikibase/lib/includes/Sites/SiteMatrixParser.php: [[gerrit:416745|Add code of special wikis as interwiki when populating sites table (T183019)]] (duration: 01m 16s) [14:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:01] T183019: Wikibase must not insert local recentchanges entries for nonexistent local users (days: 5) - https://phabricator.wikimedia.org/T183019 [14:37:35] !log ladsgroup@tin Synchronized php-1.31.0-wmf.23/extensions/Wikibase/lib/tests/phpunit/Sites/SiteMatrixParserTest.php: [[gerrit:416745|Add code of special wikis as interwiki when populating sites table, part II (T183019)]] (duration: 01m 16s) [14:37:38] !log rebooting rdb* hosts in codfw for kernel security update [14:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:59] PROBLEM - MariaDB Slave Lag: s4 on db2090 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 336.03 seconds [14:38:00] (03PS1) 10Alexandros Kosiaris: WIP: Populate kubeconfigs on deployment server [puppet] - 10https://gerrit.wikimedia.org/r/416950 [14:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:42] (03CR) 10jerkins-bot: [V: 04-1] WIP: Populate kubeconfigs on deployment server [puppet] - 10https://gerrit.wikimedia.org/r/416950 (owner: 10Alexandros Kosiaris) [14:40:50] 10Operations, 10ops-codfw: mc2036 mainboard fuse failure - https://phabricator.wikimedia.org/T185587#4032047 (10MoritzMuehlenhoff) a:05RobH>03Papaul [14:41:30] !log ppchelko@tin Started deploy [cpjobqueue/deploy@aee2eb1]: Increase refreshLinks concurrency to 150 T185052 [14:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:46] T185052: Migrate RefreshLinks job to kafka - https://phabricator.wikimedia.org/T185052 [14:42:06] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@aee2eb1]: Increase refreshLinks concurrency to 150 T185052 (duration: 00m 36s) [14:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:33] !log EU SWAT is done [14:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:24] ping me if anything happens [14:52:58] Hello! What's the right place to submit an issue concerning https://github.com/wikimedia/puppet/ ? You need to update a config before May 7, otherwise your openstreetmap import will break [14:53:58] Stereo: https://phabricator.wikimedia.org/maniphest/task/edit/form/1/ (with the 'operations' tag i guess) [14:54:12] gehel: in puppet/modules/profile/files/maps/osm-initial-import, http://planet.openstreetmap.org should be updated to https://planet.openstreetmap.org [14:54:23] I'll submit an issue there. Thanks! [14:55:13] Stereo: thanks! We are rewriting that part... so we'll try to make sure to switch to HTTPS this time... [14:55:49] Ah, great, no need for me to open a ticket then :] [14:56:54] Stereo: provided I have that new import script by then :) [14:57:06] Actually I see that the http:// link is only in the documentation, and that's why my search found it. Your curl does use the https version correctly. [14:57:14] So no need to panic before May :) [14:57:32] the initial import isn't much of an issue, since we only run it on new servers. I'll make sure we don't have a similar issue for the updates [15:00:18] Stereo: I can confirm: we already use HTTPS for all our current servers. [15:00:37] Stereo: thanks for making me chack that! Better safe than sorry... [15:00:43] (03PS1) 10Urbanecm: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416960 (https://phabricator.wikimedia.org/T189121) [15:02:49] (03PS1) 10Gehel: wdqs: configure new servers wdqs200[4-6] [puppet] - 10https://gerrit.wikimedia.org/r/416961 (https://phabricator.wikimedia.org/T187766) [15:04:00] (03PS4) 10Giuseppe Lavagetto: site.pp: reorganize MediaWiki appservers in codfw for role/row [puppet] - 10https://gerrit.wikimedia.org/r/404498 [15:04:02] (03PS1) 10Giuseppe Lavagetto: mwdebug: create the two VMs in codfw [puppet] - 10https://gerrit.wikimedia.org/r/416962 (https://phabricator.wikimedia.org/T187468) [15:10:36] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10310/ no real changes." [puppet] - 10https://gerrit.wikimedia.org/r/404498 (owner: 10Giuseppe Lavagetto) [15:10:47] 10Operations, 10ops-eqsin, 10Traffic, 10netops: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#4032161 (10faidon) I believe this was blocked until today on an SFP replacement (T188923). It seems that the IP of the Atlas is responding now, and we even receive an SSH banner. So I j... [15:10:59] 10Operations, 10ops-eqsin, 10Traffic, 10netops: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#4032163 (10faidon) [15:11:30] (03CR) 10Giuseppe Lavagetto: [C: 032] mwdebug: create the two VMs in codfw [puppet] - 10https://gerrit.wikimedia.org/r/416962 (https://phabricator.wikimedia.org/T187468) (owner: 10Giuseppe Lavagetto) [15:12:08] (03PS4) 10Urbanecm: Initial configuration for romdwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412902 (https://phabricator.wikimedia.org/T187184) [15:13:19] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for romdwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412902 (https://phabricator.wikimedia.org/T187184) (owner: 10Urbanecm) [15:13:29] (03CR) 10Reedy: [C: 032] New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416960 (https://phabricator.wikimedia.org/T189121) (owner: 10Urbanecm) [15:15:21] (03Merged) 10jenkins-bot: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416960 (https://phabricator.wikimedia.org/T189121) (owner: 10Urbanecm) [15:16:17] PROBLEM - MariaDB Slave Lag: s4 on db1064 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 333.79 seconds [15:16:41] checking [15:17:02] !log reedy@tin Synchronized wmf-config/throttle.php: T189121 (duration: 01m 15s) [15:17:09] A query from terbium apparently [15:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:19] T189121: Temporary lift of IP cap for Wikipedia Edit-a-thon in Amman - https://phabricator.wikimedia.org/T189121 [15:18:20] (03CR) 10jenkins-bot: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416960 (https://phabricator.wikimedia.org/T189121) (owner: 10Urbanecm) [15:20:02] !log rebooting krypton (running grafana among others) for kernel security update [15:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:17] marostegui: it is disk #2 [15:20:47] haha I was just running megacli :) [15:20:50] if you are still working, offline it (if it will not break the raid) [15:20:56] I can do it if not [15:20:58] yeah, let me check [15:21:03] (I am working) :) [15:21:08] ok, then all on your hands [15:21:13] coolio! [15:23:45] disk 6 is already failed [15:23:56] It is in a different span, but I am double checking [15:26:45] !log Set disk 32:2 on db1064 as offline [15:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:03] I got a page [15:28:16] no icinga-wm alert here? [15:28:28] volans: ˜/icinga-wm 16:16> PROBLEM - MariaDB Slave Lag: s4 on db1064 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 333.79 seconds [15:28:51] ouch... 12 minutes delay for the page [15:28:54] sorry my bad [15:29:20] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1064 - https://phabricator.wikimedia.org/T188685#4032253 (10Marostegui) I have set to offline 32:2 due to errors. This host has now 2 failed disks. @Cmjohnson do you have some used disks somewhere? at least to replace one of them. We have now 2 spans degr... [15:34:54] 10Operations, 10ops-eqiad, 10Analytics-Cluster, 10Analytics-Kanban, 10User-Elukey: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4032261 (10elukey) Added the hadoop partitions to all the nodes but 1076, waiting for Rob's green light before proceeding. I had to run puppe... [15:40:46] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1064 - https://phabricator.wikimedia.org/T188685#4032277 (10Cmjohnson) Let me see what I have for used spare disks [15:41:44] 10Operations, 10Traffic, 10Performance: Resources and pages occasionally take seconds to respond or fail - https://phabricator.wikimedia.org/T189085#4031089 (10ema) The host was close to its weekly varnish-be scheduled restart, and I've handled it by manually restarting the varnish backend instance. We used... [15:42:03] (03PS1) 10Marostegui: db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416964 (https://phabricator.wikimedia.org/T188685) [15:42:12] (03PS1) 10Ottomata: Configure YARN memory settings based on total node memory [puppet] - 10https://gerrit.wikimedia.org/r/416965 (https://phabricator.wikimedia.org/T188294) [15:42:53] (03CR) 10jerkins-bot: [V: 04-1] Configure YARN memory settings based on total node memory [puppet] - 10https://gerrit.wikimedia.org/r/416965 (https://phabricator.wikimedia.org/T188294) (owner: 10Ottomata) [15:44:22] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416964 (https://phabricator.wikimedia.org/T188685) (owner: 10Marostegui) [15:44:53] 10Operations, 10Traffic, 10Performance: Resources and pages occasionally take seconds to respond or fail - https://phabricator.wikimedia.org/T189085#4031089 (10BBlack) I suspect this ticket, the above-referenced T175803, and T181315 are all inter-related or possibly pointing at the same underlying issue, jus... [15:45:33] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416964 (https://phabricator.wikimedia.org/T188685) (owner: 10Marostegui) [15:46:16] (03PS2) 10Ottomata: Configure YARN memory settings based on total node memory [puppet] - 10https://gerrit.wikimedia.org/r/416965 (https://phabricator.wikimedia.org/T188294) [15:47:20] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1064, it is not performing well with 2 failed disks - T188685 (duration: 01m 16s) [15:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:38] T188685: Degraded RAID on db1064 - https://phabricator.wikimedia.org/T188685 [15:48:11] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416964 (https://phabricator.wikimedia.org/T188685) (owner: 10Marostegui) [15:49:31] (03PS2) 10Madhuvishy: dumps: Move rsync ferm rules to base rsync config [puppet] - 10https://gerrit.wikimedia.org/r/416908 (https://phabricator.wikimedia.org/T188727) [15:53:03] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1064 - https://phabricator.wikimedia.org/T188685#4032306 (10Marostegui) With the two servers disks failed and the server depooled it is struggling to catch up. It is slowly doing... [15:54:56] !log rebooting rdb* fallback hosts in eqiad for kernel security update [15:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:12] (03PS1) 10Giuseppe Lavagetto: conftool-data: add mwdebug2001-2 [puppet] - 10https://gerrit.wikimedia.org/r/416966 [15:58:14] (03PS1) 10Giuseppe Lavagetto: debug_proxy: point to the new codfw test servers [puppet] - 10https://gerrit.wikimedia.org/r/416967 (https://phabricator.wikimedia.org/T187468) [15:58:16] (03PS1) 10Giuseppe Lavagetto: codfw: decommission mw2017-2099 [puppet] - 10https://gerrit.wikimedia.org/r/416968 (https://phabricator.wikimedia.org/T187467) [15:58:45] (03CR) 10jerkins-bot: [V: 04-1] conftool-data: add mwdebug2001-2 [puppet] - 10https://gerrit.wikimedia.org/r/416966 (owner: 10Giuseppe Lavagetto) [16:00:32] (03PS1) 10Giuseppe Lavagetto: Switch to mwdebug hosts in codfw too. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416970 (https://phabricator.wikimedia.org/T187468) [16:01:01] (03PS1) 10Elukey: role::eventlogging::analytics::files: limit retention of logs in labs [puppet] - 10https://gerrit.wikimedia.org/r/416971 (https://phabricator.wikimedia.org/T171203) [16:01:29] (03PS2) 10Giuseppe Lavagetto: conftool-data: add mwdebug2001-2 [puppet] - 10https://gerrit.wikimedia.org/r/416966 (https://phabricator.wikimedia.org/T187468) [16:01:31] (03PS2) 10Giuseppe Lavagetto: debug_proxy: point to the new codfw test servers [puppet] - 10https://gerrit.wikimedia.org/r/416967 (https://phabricator.wikimedia.org/T187468) [16:01:33] (03PS2) 10Giuseppe Lavagetto: codfw: decommission mw2017-2099 [puppet] - 10https://gerrit.wikimedia.org/r/416968 (https://phabricator.wikimedia.org/T187467) [16:01:37] (03PS3) 10Madhuvishy: dumps: Move rsync ferm rules to base rsync config [puppet] - 10https://gerrit.wikimedia.org/r/416908 (https://phabricator.wikimedia.org/T188727) [16:02:22] PROBLEM - mediawiki-installation DSH group on mwdebug2002 is CRITICAL: Host mwdebug2002 is not in mediawiki-installation dsh group [16:02:31] PROBLEM - Nginx local proxy to apache on mwdebug2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:03:00] (03CR) 10Ottomata: [C: 032] role::eventlogging::analytics::files: limit retention of logs in labs [puppet] - 10https://gerrit.wikimedia.org/r/416971 (https://phabricator.wikimedia.org/T171203) (owner: 10Elukey) [16:03:13] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool-data: add mwdebug2001-2 [puppet] - 10https://gerrit.wikimedia.org/r/416966 (https://phabricator.wikimedia.org/T187468) (owner: 10Giuseppe Lavagetto) [16:03:15] (03CR) 10Ottomata: [C: 031] "Cool, does rotate 1000 need to be changed too?" [puppet] - 10https://gerrit.wikimedia.org/r/416971 (https://phabricator.wikimedia.org/T171203) (owner: 10Elukey) [16:03:58] (03PS1) 10KartikMistry: ContentTranslation: Set cookieDomain to null for Production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416973 [16:04:02] 10Operations, 10ops-eqsin, 10Traffic: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157#4032343 (10RobH) There was a bit of confusion on this. I've opened a self dispatch (SR961821650) to try to get the part sent out. However, as its the first Singapore dispatch, it will likely fail. The Net... [16:04:21] PROBLEM - Apache HTTP on mwdebug2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:04:50] (03CR) 10Ottomata: "Looks like I can't use PCC on a spare::system node, so the an70 one fails. But, the an30 one looks like a no-op , soooo good? https://pup" [puppet] - 10https://gerrit.wikimedia.org/r/416965 (https://phabricator.wikimedia.org/T188294) (owner: 10Ottomata) [16:05:16] (03PS4) 10Madhuvishy: dumps: Rename and move profile with distribution ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/416908 (https://phabricator.wikimedia.org/T188727) [16:05:26] <_joe_> ottomata: how could you not be able to use PCC on a spare::system? [16:05:29] <_joe_> what do you mean? [16:05:38] <_joe_> they're normal hosts with normal puppet runs [16:06:20] * _joe_ away [16:06:26] _joe_: https://puppet-compiler.wmflabs.org/compiler03/10321/analytics1070.eqiad.wmnet/ [16:06:34] https://puppet-compiler.wmflabs.org/compiler03/10321/ [16:06:36] it didn't find the host? [16:06:40] maybe i'm doing something dumbwrong [16:06:46] (03PS5) 10Madhuvishy: dumps: Rename and move profile with distribution ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/416908 (https://phabricator.wikimedia.org/T188727) [16:06:59] ottomata: we have to export facts to pcc for new hosts, this is why it is faiing! updating pcc now [16:07:06] 2 mins and it should be ok [16:07:16] export facts to pcc? [16:07:21] OH it needs a puppet run there first? [16:07:23] on the pcc host? [16:07:27] <_joe_> elukey: let the man know what this is all about [16:07:29] <_joe_> ottomata: nope [16:07:52] <_joe_> you have to download a sanitized version of the yamls for facts from production [16:07:54] https://wikitech.wikimedia.org/wiki/Nova_Resource:Puppet3-diffs [16:07:58] there you go [16:07:59] <_joe_> and upload them to the compilers [16:08:29] !log updating pcc facts for new hosts [16:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:12] PROBLEM - HHVM rendering on mwdebug2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:11:56] (03PS1) 10Dzahn: hiera/common: add maintenance_server [puppet] - 10https://gerrit.wikimedia.org/r/416975 [16:12:51] (03CR) 10Madhuvishy: "https://puppet-compiler.wmflabs.org/compiler02/10322/" [puppet] - 10https://gerrit.wikimedia.org/r/416908 (https://phabricator.wikimedia.org/T188727) (owner: 10Madhuvishy) [16:14:41] PROBLEM - Nginx local proxy to apache on mwdebug2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:15:46] (03CR) 10Elukey: "We could, I didn't think about it. So atm it rotates daily maximum 1000 times before deleting, or if a file is older than 30 days. Not sur" [puppet] - 10https://gerrit.wikimedia.org/r/416971 (https://phabricator.wikimedia.org/T171203) (owner: 10Elukey) [16:16:03] (03CR) 10Imarlier: [C: 031] "Rockin' - and helpful!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416879 (owner: 10Krinkle) [16:16:21] PROBLEM - Check whether ferm is active by checking the default input chain on mwdebug2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:16:21] PROBLEM - nutcracker port on mwdebug2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:17:11] PROBLEM - HHVM rendering on mwdebug2002 is CRITICAL: connect to address 10.192.16.66 and port 80: Connection refused [16:17:12] RECOVERY - nutcracker port on mwdebug2001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [16:17:12] RECOVERY - Check whether ferm is active by checking the default input chain on mwdebug2001 is OK: OK ferm input default policy is set [16:17:36] ottomata: there you go https://puppet-compiler.wmflabs.org/compiler02/10324/analytics1070.eqiad.wmnet/ [16:17:46] (03CR) 10Giuseppe Lavagetto: [C: 031] "> Rockin' - and helpful!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416879 (owner: 10Krinkle) [16:18:01] PROBLEM - HHVM rendering on mwdebug2001 is CRITICAL: connect to address 10.192.0.98 and port 80: Connection refused [16:18:12] (03CR) 10ArielGlenn: [C: 031] "Looks sane enough." [puppet] - 10https://gerrit.wikimedia.org/r/416908 (https://phabricator.wikimedia.org/T188727) (owner: 10Madhuvishy) [16:19:21] PROBLEM - Apache HTTP on mwdebug2002 is CRITICAL: connect to address 10.192.16.66 and port 80: Connection refused [16:19:37] (03PS6) 10Madhuvishy: dumps: Rename and move profile with distribution ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/416908 (https://phabricator.wikimedia.org/T188727) [16:19:49] 10Operations, 10Electron-PDFs, 10OfflineContentGenerator, 10Services (designing): Improve stability and maintainability of our browser-based PDF render service - https://phabricator.wikimedia.org/T172815#4032411 (10phuedx) [16:21:31] RECOVERY - Apache HTTP on mwdebug2001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 623 bytes in 9.149 second response time [16:21:41] RECOVERY - Nginx local proxy to apache on mwdebug2001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 622 bytes in 0.230 second response time [16:22:01] RECOVERY - HHVM rendering on mwdebug2001 is OK: HTTP OK: HTTP/1.1 200 OK - 75053 bytes in 2.867 second response time [16:22:14] (03CR) 10Dzahn: [C: 032] hiera/common: add maintenance_server [puppet] - 10https://gerrit.wikimedia.org/r/416975 (owner: 10Dzahn) [16:24:06] (03PS1) 10Marostegui: db1064: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/416977 (https://phabricator.wikimedia.org/T188685) [16:24:47] (03CR) 10Marostegui: [C: 032] db1064: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/416977 (https://phabricator.wikimedia.org/T188685) (owner: 10Marostegui) [16:25:19] (03CR) 10Madhuvishy: [C: 032] dumps: Rename and move profile with distribution ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/416908 (https://phabricator.wikimedia.org/T188727) (owner: 10Madhuvishy) [16:25:27] (03PS7) 10Madhuvishy: dumps: Rename and move profile with distribution ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/416908 (https://phabricator.wikimedia.org/T188727) [16:25:32] (03CR) 10Madhuvishy: [V: 032 C: 032] dumps: Rename and move profile with distribution ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/416908 (https://phabricator.wikimedia.org/T188727) (owner: 10Madhuvishy) [16:26:50] (03CR) 10Elukey: [C: 032] role::eventlogging::analytics::files: limit retention of logs in labs [puppet] - 10https://gerrit.wikimedia.org/r/416971 (https://phabricator.wikimedia.org/T171203) (owner: 10Elukey) [16:26:56] (03PS2) 10Elukey: role::eventlogging::analytics::files: limit retention of logs in labs [puppet] - 10https://gerrit.wikimedia.org/r/416971 (https://phabricator.wikimedia.org/T171203) [16:27:11] PROBLEM - DPKG on mwdebug2002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:28:07] RECOVERY - DPKG on mwdebug2002 is OK: All packages OK [16:29:17] RECOVERY - HHVM rendering on mwdebug2002 is OK: HTTP OK: HTTP/1.1 200 OK - 75053 bytes in 3.065 second response time [16:29:17] RECOVERY - Apache HTTP on mwdebug2002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 621 bytes in 0.131 second response time [16:31:57] 10Operations, 10Ops-Access-Requests, 10Research, 10Research-collaborations, 10Research-management: Request access to data for Wikimedia Donation Patterns research - https://phabricator.wikimedia.org/T188945#4032451 (10RobH) [16:34:01] gehel: thank you too :) [16:36:10] 10Operations, 10Ops-Access-Requests, 10Research, 10Research-collaborations, 10Research-management: Request access to data for Wikimedia Donation Patterns research - https://phabricator.wikimedia.org/T188945#4032455 (10RobH) [16:41:46] (03PS1) 10Madhuvishy: dumps: Rename and move distribution hiera rsync config [puppet] - 10https://gerrit.wikimedia.org/r/416980 (https://phabricator.wikimedia.org/T188727) [16:42:45] (03PS2) 10Nuria: Archiving data on piwik every 8 hrs [puppet] - 10https://gerrit.wikimedia.org/r/416494 (https://phabricator.wikimedia.org/T188939) [16:42:53] (03PS1) 10Dzahn: mediawiki_maintenance: activate crons based on fqdn, not mw_primary [puppet] - 10https://gerrit.wikimedia.org/r/416981 [16:43:02] RECOVERY - MariaDB Slave Lag: s4 on db1064 is OK: OK slave_sql_lag Replication lag: 47.56 seconds [16:43:13] (03CR) 10Dzahn: [C: 032] "follow-up https://gerrit.wikimedia.org/r/#/c/416981/" [puppet] - 10https://gerrit.wikimedia.org/r/416975 (owner: 10Dzahn) [16:45:15] RECOVERY - Nginx local proxy to apache on mwdebug2002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 622 bytes in 0.199 second response time [16:45:31] (03PS3) 10Nuria: Archiving data on piwik every 8 hrs [puppet] - 10https://gerrit.wikimedia.org/r/416494 (https://phabricator.wikimedia.org/T188939) [16:46:18] !log ppchelko@tin Started deploy [cpjobqueue/deploy@ff41710]: Increase refreshLinks concurrency to 250 T185052 [16:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:33] T185052: Migrate RefreshLinks job to kafka - https://phabricator.wikimedia.org/T185052 [16:46:51] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@ff41710]: Increase refreshLinks concurrency to 250 T185052 (duration: 00m 33s) [16:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:22] (03CR) 10Dzahn: "per chat with Moritz" [puppet] - 10https://gerrit.wikimedia.org/r/416981 (owner: 10Dzahn) [16:49:36] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/10325/" [puppet] - 10https://gerrit.wikimedia.org/r/416981 (owner: 10Dzahn) [16:49:38] (03Abandoned) 10Niedzielski: New: add chromium_render service [puppet] - 10https://gerrit.wikimedia.org/r/409996 (https://phabricator.wikimedia.org/T178166) (owner: 10Niedzielski) [16:50:50] (03PS1) 10Madhuvishy: dumps: Refactor fetcher profile [puppet] - 10https://gerrit.wikimedia.org/r/416983 (https://phabricator.wikimedia.org/T188727) [16:51:26] (03PS9) 10Dzahn: mediawiki: Add explicit dependency on ghostscript [puppet] - 10https://gerrit.wikimedia.org/r/313963 (owner: 10Muehlenhoff) [16:51:47] (03CR) 10Andrew Bogott: wikitech: use files from swift rather than local uploads. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416607 (https://phabricator.wikimedia.org/T188915) (owner: 10Andrew Bogott) [16:52:40] (03PS1) 10Madhuvishy: dumps: Enable fetcher for labstore1006|7 [puppet] - 10https://gerrit.wikimedia.org/r/416984 (https://phabricator.wikimedia.org/T188727) [16:53:17] (03PS1) 10Mark Bergsma: [WiP] Split off attributes and exceptions from bgp.py into their own modules [debs/pybal] - 10https://gerrit.wikimedia.org/r/416985 [16:54:56] 10Operations, 10DBA, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4032482 (10Andrew) @Marostegui, I could do it tomorrow or Friday anytime after 15:00 UTC. Next week I'm out Monday, Tuesday, Wednesday. [16:55:36] !log cp5001: reboot for retpoline kernel updates T188092 [16:55:49] 10Operations, 10DBA, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4032485 (10Marostegui) >>! In T189005#4032482, @Andrew wrote: > @Marostegui, I could do it tomorrow or Friday anytime after 15:00 UTC. Next week I'm out Mon... [16:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:00] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#4032488 (10pmiazga) [16:58:31] (03PS4) 10Andrew Bogott: dns labsaliaser: reload lua script whenever it's updated. [puppet] - 10https://gerrit.wikimedia.org/r/416852 (https://phabricator.wikimedia.org/T188619) [16:59:23] (03PS3) 10Bstorm: wiki-replicas: Accommodate new comments table with rules and compatibility [puppet] - 10https://gerrit.wikimedia.org/r/416496 (https://phabricator.wikimedia.org/T181650) [17:01:47] (03PS4) 10Elukey: Archiving data on piwik every 8 hrs [puppet] - 10https://gerrit.wikimedia.org/r/416494 (https://phabricator.wikimedia.org/T188939) (owner: 10Nuria) [17:02:21] (03CR) 10Elukey: [C: 032] Archiving data on piwik every 8 hrs [puppet] - 10https://gerrit.wikimedia.org/r/416494 (https://phabricator.wikimedia.org/T188939) (owner: 10Nuria) [17:02:26] RECOVERY - mediawiki-installation DSH group on mwdebug2002 is OK: OK [17:05:06] (03CR) 10Andrew Bogott: [C: 032] dns labsaliaser: reload lua script whenever it's updated. [puppet] - 10https://gerrit.wikimedia.org/r/416852 (https://phabricator.wikimedia.org/T188619) (owner: 10Andrew Bogott) [17:05:17] (03PS5) 10Andrew Bogott: dns labsaliaser: reload lua script whenever it's updated. [puppet] - 10https://gerrit.wikimedia.org/r/416852 (https://phabricator.wikimedia.org/T188619) [17:07:08] (03PS2) 10Andrew Bogott: labs dns: add some docs for labs-ip-alias-dump [puppet] - 10https://gerrit.wikimedia.org/r/416860 (owner: 10BryanDavis) [17:07:45] (03CR) 10Andrew Bogott: [C: 032] labs dns: add some docs for labs-ip-alias-dump [puppet] - 10https://gerrit.wikimedia.org/r/416860 (owner: 10BryanDavis) [17:13:22] (03Abandoned) 10Elukey: role::analytics_cluster::client: force remount of HDFS mountpoint [puppet] - 10https://gerrit.wikimedia.org/r/416442 (https://phabricator.wikimedia.org/T187073) (owner: 10Elukey) [17:22:19] (03PS1) 10Arturo Borrero Gonzalez: apt: apt-upgrades: avoid debconf prompts [puppet] - 10https://gerrit.wikimedia.org/r/416988 (https://phabricator.wikimedia.org/T189018) [17:23:18] (03CR) 10Arturo Borrero Gonzalez: [C: 032] apt: apt-upgrades: avoid debconf prompts [puppet] - 10https://gerrit.wikimedia.org/r/416988 (https://phabricator.wikimedia.org/T189018) (owner: 10Arturo Borrero Gonzalez) [17:36:33] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137#4032605 (10Pchelolo) p:05Triage>03High [17:37:00] 10Operations, 10ops-eqiad, 10Analytics-Cluster, 10Analytics-Kanban, 10User-Elukey: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4032619 (10RobH) [17:37:37] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10User-Elukey: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4002756 (10RobH) a:05RobH>03elukey Analytics1076 is ready to go as well now! You can resolve or close this task as you need/want to track implementation. [17:41:09] !log rebooting restbase-test* for kernel security update [17:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:36] (03PS1) 10Chico Venancio: Cloud VPS: update bastion banner [puppet] - 10https://gerrit.wikimedia.org/r/416990 (https://phabricator.wikimedia.org/T168480) [17:51:47] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10User-Elukey: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4032665 (10elukey) Thanks Rob! Just created the hadoop partitions on 76 as well. Since those are Stretch nodes we are going to slowly put them in production... [17:53:10] (03CR) 10Volans: "Seems a good starting point. I think you need also common/monitoring.yaml" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/415872 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel) [17:53:36] PROBLEM - puppet last run on analytics1057 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/check_systemd_state] [17:54:00] (03PS1) 10BryanDavis: openstack: Refactor dns-floating-ip-updater.py script [puppet] - 10https://gerrit.wikimedia.org/r/416991 [17:58:11] (03PS9) 10Gehel: wdqs: configure the new internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/415872 (https://phabricator.wikimedia.org/T187766) [17:58:35] (03PS1) 10Ppchelko: Disable redis queue for cirrusSearch jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416992 (https://phabricator.wikimedia.org/T189137) [18:00:05] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do Morning SWAT (Max 8 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180307T1800). [18:00:05] Zoranzoki21, tgr, twkozlowski, and Urbanecm: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:02:57] (03PS1) 10RobH: add shell user dynkm/oliver keyes [puppet] - 10https://gerrit.wikimedia.org/r/416993 (https://phabricator.wikimedia.org/T188945) [18:03:33] (03PS1) 10Elukey: role::eventlogging::analytics::files: set su directive to logrotate [puppet] - 10https://gerrit.wikimedia.org/r/416994 (https://phabricator.wikimedia.org/T171203) [18:04:50] (03PS8) 10Volans: Icinga: add sync check for MW config on etcd [puppet] - 10https://gerrit.wikimedia.org/r/413355 (https://phabricator.wikimedia.org/T182597) [18:04:52] (03PS9) 10Volans: Icinga: add EtcdConfig sync check on MW hosts [puppet] - 10https://gerrit.wikimedia.org/r/413356 (https://phabricator.wikimedia.org/T182597) [18:05:17] (03CR) 10Volans: "Replies inline." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/413355 (https://phabricator.wikimedia.org/T182597) (owner: 10Volans) [18:07:20] (03CR) 10Nuria: "Thanks for doing this" [puppet] - 10https://gerrit.wikimedia.org/r/416994 (https://phabricator.wikimedia.org/T171203) (owner: 10Elukey) [18:08:06] (03CR) 10Anomie: wiki-replicas: Accommodate new comments table with rules and compatibility (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/416496 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [18:08:10] (03PS1) 10RobH: adding dynkm/oliver keyes to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/416996 (https://phabricator.wikimedia.org/T188945) [18:08:16] (03PS2) 10Madhuvishy: dumps: Rename and move distribution hiera rsync config [puppet] - 10https://gerrit.wikimedia.org/r/416980 (https://phabricator.wikimedia.org/T188727) [18:08:34] anyone doing swat? [18:09:46] 10Operations, 10Ops-Access-Requests, 10Research, 10Research-collaborations, and 2 others: Request access to data for Wikimedia Donation Patterns research - https://phabricator.wikimedia.org/T188945#4032767 (10RobH) [18:10:37] I can do it I guess [18:10:44] (03PS2) 10Elukey: role::eventlogging::analytics::files: set su directive to logrotate [puppet] - 10https://gerrit.wikimedia.org/r/416994 (https://phabricator.wikimedia.org/T171203) [18:12:46] (03PS3) 10Giuseppe Lavagetto: debug_proxy: point to the new codfw test servers [puppet] - 10https://gerrit.wikimedia.org/r/416967 (https://phabricator.wikimedia.org/T187468) [18:13:00] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1064 - https://phabricator.wikimedia.org/T188685#4032774 (10Cmjohnson) @Marostegui I swapped both disks with used disks we had from decommissioned servers. The disks are currently rebuilding. Please resolve this task once it's complet... [18:13:02] (03CR) 10Gergő Tisza: [C: 032] Enable loginOnly mode for local auth provider on group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416630 (https://phabricator.wikimedia.org/T57420) (owner: 10Gergő Tisza) [18:13:08] (03CR) 10Gergő Tisza: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416630 (https://phabricator.wikimedia.org/T57420) (owner: 10Gergő Tisza) [18:15:01] 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 2 others: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#4032782 (10Cmjohnson) The dell technician rescheduled for Thursday 8 March. Tentatively scheduled for 1800UTC. [18:15:04] (03CR) 10Volans: "Compiler results available at:" [puppet] - 10https://gerrit.wikimedia.org/r/413355 (https://phabricator.wikimedia.org/T182597) (owner: 10Volans) [18:15:47] (03CR) 10Giuseppe Lavagetto: [C: 032] debug_proxy: point to the new codfw test servers [puppet] - 10https://gerrit.wikimedia.org/r/416967 (https://phabricator.wikimedia.org/T187468) (owner: 10Giuseppe Lavagetto) [18:17:35] (03CR) 10Andrew Bogott: [C: 031] Use hiera3 role/nuyaml backends on >= stretch [puppet] - 10https://gerrit.wikimedia.org/r/416850 (https://phabricator.wikimedia.org/T188623) (owner: 10Herron) [18:18:33] (03PS2) 10Gergő Tisza: Enable loginOnly mode for local auth provider on group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416630 (https://phabricator.wikimedia.org/T57420) [18:19:17] (03CR) 10Herron: [C: 032] Use hiera3 role/nuyaml backends on >= stretch [puppet] - 10https://gerrit.wikimedia.org/r/416850 (https://phabricator.wikimedia.org/T188623) (owner: 10Herron) [18:19:24] (03PS4) 10Herron: Use hiera3 role/nuyaml backends on >= stretch [puppet] - 10https://gerrit.wikimedia.org/r/416850 (https://phabricator.wikimedia.org/T188623) [18:19:53] (03CR) 10Elukey: "Current pcc: https://puppet-compiler.wmflabs.org/compiler02/10329/eventlog1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/416994 (https://phabricator.wikimedia.org/T171203) (owner: 10Elukey) [18:20:17] PROBLEM - HHVM jobrunner on mw1308 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [18:21:00] (03CR) 10Gergő Tisza: [C: 032] Enable loginOnly mode for local auth provider on group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416630 (https://phabricator.wikimedia.org/T57420) (owner: 10Gergő Tisza) [18:21:11] (03PS3) 10Elukey: role::eventlogging::analytics::files: set su directive to logrotate [puppet] - 10https://gerrit.wikimedia.org/r/416994 (https://phabricator.wikimedia.org/T171203) [18:21:17] RECOVERY - HHVM jobrunner on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [18:22:02] (03CR) 10Elukey: [C: 032] role::eventlogging::analytics::files: set su directive to logrotate [puppet] - 10https://gerrit.wikimedia.org/r/416994 (https://phabricator.wikimedia.org/T171203) (owner: 10Elukey) [18:22:04] (03CR) 10Muehlenhoff: "I spoke to Mukunda on IRC and while PHP 7.2 is only three months old, we'll go with it. I'll update the patches tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/415856 (owner: 10Muehlenhoff) [18:22:12] (03Merged) 10jenkins-bot: Enable loginOnly mode for local auth provider on group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416630 (https://phabricator.wikimedia.org/T57420) (owner: 10Gergő Tisza) [18:22:26] (03CR) 10Giuseppe Lavagetto: [C: 032] Switch to mwdebug hosts in codfw too. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416970 (https://phabricator.wikimedia.org/T187468) (owner: 10Giuseppe Lavagetto) [18:22:28] (03CR) 10jenkins-bot: Enable loginOnly mode for local auth provider on group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416630 (https://phabricator.wikimedia.org/T57420) (owner: 10Gergő Tisza) [18:23:36] RECOVERY - puppet last run on analytics1057 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:23:37] (03Merged) 10jenkins-bot: Switch to mwdebug hosts in codfw too. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416970 (https://phabricator.wikimedia.org/T187468) (owner: 10Giuseppe Lavagetto) [18:23:51] tgr: I'm here, thanks for the update to the task, I almost forgot ;) [18:24:05] (03PS3) 10Ottomata: Configure YARN memory settings based on total node memory [puppet] - 10https://gerrit.wikimedia.org/r/416965 (https://phabricator.wikimedia.org/T188294) [18:24:16] (03CR) 10Ottomata: [V: 032 C: 032] Configure YARN memory settings based on total node memory [puppet] - 10https://gerrit.wikimedia.org/r/416965 (https://phabricator.wikimedia.org/T188294) (owner: 10Ottomata) [18:24:19] (03PS3) 10Gergő Tisza: Update logos for Limburgish and Picardic Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416939 (https://phabricator.wikimedia.org/T189116) (owner: 10Odder) [18:25:34] _joe_: are you ok with enabling translation cache on canary_appservers? (https://gerrit.wikimedia.org/r/#/c/414876/) ? i see that you did it on deployment-prep last August https://phabricator.wikimedia.org/T103886#3489735 [18:25:47] <_joe_> mutante: +1 [18:25:53] (03CR) 10Bstorm: wiki-replicas: Accommodate new comments table with rules and compatibility (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/416496 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [18:26:11] thanks!, 'k [18:26:35] (03CR) 10Andrew Bogott: [C: 04-1] "I think this would be easier to review/test if you break out the novaenv.sh/yaml refactor into a separate earlier patch." [puppet] - 10https://gerrit.wikimedia.org/r/416991 (owner: 10BryanDavis) [18:26:37] !log tgr@tin Synchronized wmf-config/InitialiseSettings.php: T57420 Enable loginOnly mode for local auth provider on group 1 (duration: 01m 20s) [18:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:58] (03CR) 10Gergő Tisza: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416939 (https://phabricator.wikimedia.org/T189116) (owner: 10Odder) [18:26:58] T57420: Remove local wiki password hash when CentralAuth has attached account - https://phabricator.wikimedia.org/T57420 [18:26:58] <_joe_> tgr: I've sneaked a patch of mine after yours [18:26:58] (03PS3) 10Dzahn: Enable reusable TC on HHVM on canary appservers [puppet] - 10https://gerrit.wikimedia.org/r/414876 (https://phabricator.wikimedia.org/T103886) (owner: 10Chad) [18:27:13] (03CR) 10jenkins-bot: Switch to mwdebug hosts in codfw too. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416970 (https://phabricator.wikimedia.org/T187468) (owner: 10Giuseppe Lavagetto) [18:27:19] <_joe_> is that ok to sync? I didn't realize we're in the middle of swat [18:27:40] <_joe_> I somehow was sure it was 1 hour ago, sorry :( [18:28:04] _joe_: sure, want me to sync it? [18:28:08] (03Merged) 10jenkins-bot: Update logos for Limburgish and Picardic Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416939 (https://phabricator.wikimedia.org/T189116) (owner: 10Odder) [18:28:09] <_joe_> thanks [18:28:15] <_joe_> and sorry for the inconvenience [18:28:24] <_joe_> I realized it once I was on tin :P [18:28:30] <_joe_> I already updated the file there [18:28:42] PROBLEM - DPKG on analytics1070 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:29:03] new node --^ [18:29:27] yuppers [18:29:42] RECOVERY - DPKG on analytics1070 is OK: All packages OK [18:29:47] need to build some python packages that i guess are only on workers, on it [18:29:51] (03PS4) 10Bstorm: wiki-replicas: Accommodate new comments table with rules and compatibility [puppet] - 10https://gerrit.wikimedia.org/r/416496 (https://phabricator.wikimedia.org/T181650) [18:30:11] !log tgr@tin Synchronized debug.json: T187468 Switch to mwdebug hosts in codfw too (duration: 01m 15s) [18:30:24] no prob [18:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:26] T187468: Create 2 VMs in codfw for mwdebug20001 and 2002 - https://phabricator.wikimedia.org/T187468 [18:31:42] PROBLEM - puppet last run on analytics1070 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 2 minutes ago with 5 failures. Failed resources (up to 3 shown): Package[python3-mmh3],File[/etc/hadoop/prometheus_hdfs_datanode_jmx_exporter.yaml],File[/etc/hadoop/prometheus_yarn_nodemanager_jmx_exporter.yaml],Package[hadoop-client] [18:32:22] twkozlowski: it seems like the same file for both wikis, is that OK? [18:32:29] (03CR) 10jenkins-bot: Update logos for Limburgish and Picardic Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416939 (https://phabricator.wikimedia.org/T189116) (owner: 10Odder) [18:34:50] tgr: Same file? Meaning I uploaded the same logo for both projects? [18:35:03] same content [18:35:18] the language seems similar so that might be correct, just checking [18:35:28] tgr: Hm, I just checked the commit again and the logos are different [18:35:51] https://gerrit.wikimedia.org/r/cat/416939%2C3%2Cstatic/images/project-logos/liwiki-2x.png%5E0 [18:36:03] vs https://gerrit.wikimedia.org/r/cat/416939%2C3%2Cstatic/images/project-logos/pcdwiki-2x.png%5E0 [18:37:36] Reedy: hey, you made a patch in JADE extension. Can you comment and approve re-licensing it to GPL v3+? https://gerrit.wikimedia.org/r/#/c/416004/ [18:39:02] PROBLEM - Disk space on Hadoop worker on analytics1070 is CRITICAL: NRPE: Command check_disk_space_hadoop_worker not defined [18:39:03] twkozlowski: well, it's visually the same... but as long as you are OK with that [18:40:08] tgr: Yup, looks OK [18:40:29] !log tgr@tin Synchronized static/images/project-logos: T189116 Update logos for Limburgish and Picardic Wikipedias (duration: 01m 17s) [18:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:50] T189116: Update logo for Limburgish and Picardic Wikipedias - https://phabricator.wikimedia.org/T189116 [18:42:08] !log tgr@tin Synchronized wmf-config/InitialiseSettings.php: T189116 Update logos for Limburgish and Picardic Wikipedias (duration: 01m 16s) [18:42:21] twkozlowski: should be live [18:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:34] 416708 has -2 and 416960 is already merged so that's all [18:48:37] (03PS3) 10Madhuvishy: dumps: Rename and move distribution hiera rsync config [puppet] - 10https://gerrit.wikimedia.org/r/416980 (https://phabricator.wikimedia.org/T188727) [18:49:41] tgr: Hm, I'm still seeing the old versions of the logos there [18:49:51] (03CR) 10Dzahn: [C: 032] Enable reusable TC on HHVM on canary appservers [puppet] - 10https://gerrit.wikimedia.org/r/414876 (https://phabricator.wikimedia.org/T103886) (owner: 10Chad) [18:50:56] (03PS1) 10Ottomata: Don't require jessie backport version of pyton sklearn in stretch [puppet] - 10https://gerrit.wikimedia.org/r/417004 (https://phabricator.wikimedia.org/T188294) [18:51:43] (03CR) 10Ottomata: [C: 032] Don't require jessie backport version of pyton sklearn in stretch [puppet] - 10https://gerrit.wikimedia.org/r/417004 (https://phabricator.wikimedia.org/T188294) (owner: 10Ottomata) [18:52:02] RECOVERY - Disk space on Hadoop worker on analytics1070 is OK: DISK OK [18:52:22] (03CR) 10Dzahn: [C: 032] "applied on mwdebug1001 - /etc/hhvm/server.ini]/content: content changed - Scheduling refresh of Service[hhvm]" [puppet] - 10https://gerrit.wikimedia.org/r/414876 (https://phabricator.wikimedia.org/T103886) (owner: 10Chad) [18:52:28] twkozlowski: I see the new logo for pcd, not li though [18:52:44] weird, I tested it on mwdebug and it worked there [18:52:58] (03CR) 10Madhuvishy: "https://puppet-compiler.wmflabs.org/compiler02/10332/" [puppet] - 10https://gerrit.wikimedia.org/r/416980 (https://phabricator.wikimedia.org/T188727) (owner: 10Madhuvishy) [18:53:38] I'll try re-syncing InitializeSettings.php [18:54:35] 10Operations, 10Deployments, 10HHVM, 10Patch-For-Review, and 2 others: Translation cache exhaustion caused by changes to PHP code in file scope - https://phabricator.wikimedia.org/T103886#1401646 (10Dzahn) applied on mwdebug1001 via puppet, other canary appservers to follow [18:54:50] tgr: I can see the 1.5x and 2x versions for liwiki, so it's fine, I guess it's just a cache issue [18:55:23] doesn't look like it [18:55:47] some kind of scap hiccup maybe [18:56:42] RECOVERY - puppet last run on analytics1070 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:56:53] !log tgr@tin Synchronized wmf-config/InitialiseSettings.php: retry (duration: 01m 15s) [18:56:59] (03PS1) 10Ottomata: Add analytics1070-1077 rack info to hadoop net-topology [puppet] - 10https://gerrit.wikimedia.org/r/417005 (https://phabricator.wikimedia.org/T188294) [18:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:10] (03PS2) 10Ottomata: Add analytics1070-1077 rack info to hadoop net-topology [puppet] - 10https://gerrit.wikimedia.org/r/417005 (https://phabricator.wikimedia.org/T188294) [18:58:16] (03CR) 10Ottomata: [C: 032] Add analytics1070-1077 rack info to hadoop net-topology [puppet] - 10https://gerrit.wikimedia.org/r/417005 (https://phabricator.wikimedia.org/T188294) (owner: 10Ottomata) [18:58:19] (03CR) 10Ottomata: [V: 032 C: 032] Add analytics1070-1077 rack info to hadoop net-topology [puppet] - 10https://gerrit.wikimedia.org/r/417005 (https://phabricator.wikimedia.org/T188294) (owner: 10Ottomata) [19:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180307T1900) [19:00:05] No GERRIT patches in the queue for this window AFAICS. [19:01:44] twkozlowski: I can confirm that the code has been deployed on the appserver which is showing the old logo [19:01:54] (03PS2) 10Rush: Cloud VPS: update bastion banner [puppet] - 10https://gerrit.wikimedia.org/r/416990 (https://phabricator.wikimedia.org/T168480) (owner: 10Chico Venancio) [19:02:14] also, other servers show the new logo (mwdebug1002 for example) [19:02:48] (03CR) 10Rush: [C: 032] Cloud VPS: update bastion banner [puppet] - 10https://gerrit.wikimedia.org/r/416990 (https://phabricator.wikimedia.org/T168480) (owner: 10Chico Venancio) [19:03:13] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4033006 (10Ottomata) analytics1070 is in Hadoop ready for biznaass [19:04:29] _joe_: does https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug#Available_backends need an update? [19:06:33] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:06:33] PROBLEM - puppet last run on db1054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:06:42] PROBLEM - puppet last run on analytics1045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:06:42] PROBLEM - puppet last run on rdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:06:52] PROBLEM - puppet last run on db1108 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:07:13] PROBLEM - puppet last run on oxygen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:07:22] PROBLEM - puppet last run on bast5001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:07:23] PROBLEM - puppet last run on mw1294 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:08:03] PROBLEM - puppet last run on mw1279 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:08:16] tgr: Yup, I can see it now [19:08:42] PROBLEM - puppet last run on ms-be1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:08:52] PROBLEM - puppet last run on ores1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:09:12] PROBLEM - puppet last run on mc1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:09:17] I have no idea what happened there [19:09:32] PROBLEM - puppet last run on mw1276 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:09:32] PROBLEM - puppet last run on mw1293 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:09:33] the Varnish purge requests failed, maybe? [19:09:46] even though I tried a bunch of manual purges [19:10:10] tgr: I did https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug and it worked straight away [19:10:24] So looks like a cache delay, we've had that before with logos [19:10:43] twkozlowski: still doesn't work without it though [19:11:06] it's a HTML level issue, the skin is pointing at the wrong URL [19:11:33] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:11:33] RECOVERY - puppet last run on db1054 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:12:05] trying manual puppet runs on a handfull of these hosts worked, so these should clear on their own. going to try kicking off runs with cumin [19:12:11] and it's definitely not varnish, I can go to some non-cached special page and still get the old logo [19:12:36] I guess hhvm needs to be restarted on that host to pick up the change? [19:12:42] (03PS1) 10Ottomata: Revert "Revert "Point Mediawiki Monolog at Kafka jumbo in production"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417006 [19:12:43] I see the new one everywhere on both li and pdc Wikipedias [19:12:55] twkozlowski: without X-Wikimedia-Debug? [19:12:58] also, pcd [19:12:59] thcipriani: ok if I do https://gerrit.wikimedia.org/r/#/c/417006/ now? [19:13:06] tgr: Like, in my browser [19:13:32] well, you can have X-Wikimedia-Debug in your browser :) [19:13:39] (03PS2) 10Ottomata: Revert "Revert "Point Mediawiki Monolog at Kafka jumbo in production"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417006 (https://phabricator.wikimedia.org/T188136) [19:13:57] https://chrome.google.com/webstore/detail/wikimediadebug/binmakecefompkjggiklgjenddjoifbb [19:14:12] ottomata: I think tgr may be deploying right now? [19:14:31] thcipriani: just debugging, I'm done with the deploy [19:14:39] tgr: Although, hm, now that I tried private mode I do see the old one [19:14:48] the code was synced successfully, it's just not being picked up everywhere [19:15:37] thcipriani: can you restart hhvm on appservers? I think normal deployers don't have the right for that [19:16:07] tgr: I don't have rights to do that either afaik, what's happening? [19:16:30] tgr: , ok can you let me know when deploy is done :) [19:16:34] ? [19:16:37] a config change is not picked up even though the code has updated [19:16:42] RECOVERY - puppet last run on analytics1045 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:16:52] RECOVERY - puppet last run on db1108 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:17:09] oh, which change? [19:17:09] ottomata: I'm done, debugging should not interfere with the next deploy [19:17:13] (03CR) 10Chad: [C: 032] Improve load-order documentation for CommonSettings and InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416879 (owner: 10Krinkle) [19:17:13] RECOVERY - puppet last run on oxygen is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:17:16] "MediaWiki is in read-only mode for maintenance. Please try again in a few minutes." [19:17:34] thcipriani: https://gerrit.wikimedia.org/r/#/c/416939/3/wmf-config/InitialiseSettings.php [19:17:40] likely be fixed by touching InitialiseSettings.php and resyncing [19:17:44] it's not really a problem, just weird [19:17:52] tried that, didn't work [19:17:59] oh, weird. [19:18:03] RECOVERY - puppet last run on mw1279 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:18:26] !log ladsgroup@terbium:~$ mwscript extensions/Wikibase/lib/maintenance/populateSitesTable.php --wiki=mediawikiwiki --force-protocol https (T183019) [19:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:41] T183019: Wikibase must not insert local recentchanges entries for nonexistent local users (days: 5) - https://phabricator.wikimedia.org/T183019 [19:18:42] RECOVERY - puppet last run on ms-be1029 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [19:18:44] it would be pretty nasty if it happened for some more significant config change though [19:18:52] RECOVERY - puppet last run on ores1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:19:12] RECOVERY - puppet last run on mc1026 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:19:21] tgr: for what it's worth, visiting https://li.wikipedia.org/wiki/Veurblaad shows me liwiki-2x.png, as expected. [19:19:32] RECOVERY - puppet last run on mw1276 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:19:32] RECOVERY - puppet last run on mw1293 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [19:20:47] What do you see at https://pcd.wikipedia.org/wiki/Accueul Krinkle? [19:21:03] <_joe_> indeed it needs it [19:21:13] twkozlowski: I see it loading https://pcd.wikipedia.org/static/images/project-logos/pcdwiki-2x.png, which to me looks like an old logo. [19:21:24] But https://pcd.wikipedia.org/static/images/project-logos/pcdwiki-2x.png?blabla shows the new thing [19:21:27] This is not MediaWiki caching [19:21:29] this is Varnish [19:21:32] PROBLEM - HHVM rendering on mwdebug2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:21:39] updating files in /static/ requires running purge [19:21:44] RECOVERY - puppet last run on rdb1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:21:47] See the Deploy manual on wikitech. [19:22:12] Only files within $IP (/w) are auto-hashed by ResourceLoader [19:22:22] RECOVERY - HHVM rendering on mwdebug2001 is OK: HTTP OK: HTTP/1.1 200 OK - 75127 bytes in 0.326 second response time [19:22:22] RECOVERY - puppet last run on bast5001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:22:23] RECOVERY - puppet last run on mw1294 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:22:34] tgr: max-age is 1 year I think, so this'll need a Varnish purge [19:23:10] I can see the old logo when I edit the page, that can't be Varnish [19:23:23] and the file URL is different [19:23:34] or does that come from CSS? [19:23:57] tgr: Yes, it's Varnish. MediaWiki just outputs a url. [19:24:26] oh, OK, I thought the file URL is in the HTML page somewhere [19:24:32] Urls like /static/ are outside /w/ which means logged-in cookies are not considered. [19:24:49] tgr: It can be in the HTML page, but still wouldn't matter as it isn't versioned. [19:24:57] no ?123 or anything [19:26:12] ah, OK, I get it [19:26:30] didn't realize X-Wikimedia-Debug worked for static files as well [19:26:38] it's pretty obvious in hindsight [19:27:45] Krinkle: So for future reference, this is https://wikitech.wikimedia.org/wiki/Multicast_HTCP_purging#One-off_purge ? [19:28:25] tgr: Ah, yeah, logged-in is ignored for /w/load.php and /static, but XWD does bypass Varnish for both of those, given they are served from Apache (just not from PHP in case of /static/) [19:28:30] Easily to miss ineed [19:28:59] OK, purged, looks correct now [19:29:10] twkozlowski: yes [19:29:30] (03PS1) 10ArielGlenn: cheap image dump script that might be ok for wikitech [dumps] - 10https://gerrit.wikimedia.org/r/417009 (https://phabricator.wikimedia.org/T188915) [19:29:40] oh, I see it's at https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Image_Cache_Purges as well [19:29:47] (03CR) 10jerkins-bot: [V: 04-1] cheap image dump script that might be ok for wikitech [dumps] - 10https://gerrit.wikimedia.org/r/417009 (https://phabricator.wikimedia.org/T188915) (owner: 10ArielGlenn) [19:30:42] (03PS2) 10ArielGlenn: cheap image dump script that might be ok for wikitech [dumps] - 10https://gerrit.wikimedia.org/r/417009 (https://phabricator.wikimedia.org/T188915) [19:30:50] ha, probably should have read that [19:32:07] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache) - https://phabricator.wikimedia.org/T181315#4033104 (10Krinkle) [19:32:10] (03PS2) 10Herron: add puppetdb role to puppetdb[12]001 servers [puppet] - 10https://gerrit.wikimedia.org/r/409995 (https://phabricator.wikimedia.org/T185499) [19:32:29] (03CR) 10jerkins-bot: [V: 04-1] add puppetdb role to puppetdb[12]001 servers [puppet] - 10https://gerrit.wikimedia.org/r/409995 (https://phabricator.wikimedia.org/T185499) (owner: 10Herron) [19:33:07] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache) - https://phabricator.wikimedia.org/T181315#3786217 (10Krinkle) [19:33:34] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache) - https://phabricator.wikimedia.org/T181315#3786217 (10Krinkle) [19:35:32] !log ladsgroup@terbium:~$ mwscript extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https on fawiki and hewiki (T183019) [19:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:49] T183019: Wikibase must not insert local recentchanges entries for nonexistent local users (days: 5) - https://phabricator.wikimedia.org/T183019 [19:36:26] (03PS1) 10Ottomata: Set up notebook100[34] with new swap role and jupyterhub profile [puppet] - 10https://gerrit.wikimedia.org/r/417010 (https://phabricator.wikimedia.org/T183935) [19:37:08] (03CR) 10jerkins-bot: [V: 04-1] Set up notebook100[34] with new swap role and jupyterhub profile [puppet] - 10https://gerrit.wikimedia.org/r/417010 (https://phabricator.wikimedia.org/T183935) (owner: 10Ottomata) [19:37:40] (03CR) 10Krinkle: cheap image dump script that might be ok for wikitech (031 comment) [dumps] - 10https://gerrit.wikimedia.org/r/417009 (https://phabricator.wikimedia.org/T188915) (owner: 10ArielGlenn) [19:38:12] (03PS2) 10Ottomata: Set up notebook100[34] with new swap role and jupyterhub profile [puppet] - 10https://gerrit.wikimedia.org/r/417010 (https://phabricator.wikimedia.org/T183935) [19:38:23] (03PS3) 10Herron: add puppetdb role to puppetdb[12]001 servers [puppet] - 10https://gerrit.wikimedia.org/r/409995 (https://phabricator.wikimedia.org/T185499) [19:38:51] (03CR) 10jerkins-bot: [V: 04-1] Set up notebook100[34] with new swap role and jupyterhub profile [puppet] - 10https://gerrit.wikimedia.org/r/417010 (https://phabricator.wikimedia.org/T183935) (owner: 10Ottomata) [19:39:44] tgr: Hi. If I'm not mistaken, earlier today you deployed https://gerrit.wikimedia.org/r/#/c/416939/ with odder. I see the new logo at li.wikipedi.org , but still the old one at pcd.wikipedia.org . [19:40:23] aharoni: what's the logo URL you see? [19:40:29] If I look at https://pcd.wikipedia.org/static/images/project-logos/pcdwiki-2x.png , then I see an old one, but if I look at https://pcd.wikipedia.org/static/images/project-logos/pcdwiki-2x.png?action=purge , then I see the new one. [19:40:29] (03CR) 10Anomie: wiki-replicas: Accommodate new comments table with rules and compatibility (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/416496 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [19:40:46] (03CR) 10Andrew Bogott: wikitech: use files from swift rather than local uploads. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416607 (https://phabricator.wikimedia.org/T188915) (owner: 10Andrew Bogott) [19:41:08] hi aharoni ;-) [19:41:19] O HAI [19:42:00] (03PS8) 10Andrew Bogott: wikitech: use files from swift rather than local uploads. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416607 (https://phabricator.wikimedia.org/T188915) [19:42:41] (03CR) 10Krinkle: [C: 04-1] "Yikes, one more thing ;)" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415618 (https://phabricator.wikimedia.org/T188652) (owner: 10Imarlier) [19:43:03] !log ladsgroup@terbium:~$ foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https (T183019) [19:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:18] T183019: Wikibase must not insert local recentchanges entries for nonexistent local users (days: 5) - https://phabricator.wikimedia.org/T183019 [19:43:35] (03PS3) 10Ottomata: Set up notebook100[34] with new swap role and jupyterhub profile [puppet] - 10https://gerrit.wikimedia.org/r/417010 (https://phabricator.wikimedia.org/T183935) [19:44:18] (03PS4) 10Ottomata: Set up notebook100[34] with new swap role and jupyterhub profile [puppet] - 10https://gerrit.wikimedia.org/r/417010 (https://phabricator.wikimedia.org/T183935) [19:45:15] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/10335/notebook1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/417010 (https://phabricator.wikimedia.org/T183935) (owner: 10Ottomata) [19:45:17] (03CR) 10Ottomata: [C: 032] Set up notebook100[34] with new swap role and jupyterhub profile [puppet] - 10https://gerrit.wikimedia.org/r/417010 (https://phabricator.wikimedia.org/T183935) (owner: 10Ottomata) [19:47:10] twkozlowski, tgr - any idea why is it not updated at https://pcd.wikipedia.org/wiki/Accueul ? [19:48:39] aharoni: I only purged the 1x files, will fix [19:49:03] tgr: thanks [19:50:26] aharoni: should be done [19:51:53] tgr: \o/ done [19:51:55] thanks [19:52:09] twkozlowski: tgr fixed it [19:52:14] thanks both [19:52:52] ok great, thanks tgr [19:54:04] aharoni: Yup, we talked about this earlier, just needed to purge the cache ;) [19:54:12] Many thanks again tgr [19:54:12] (03PS1) 10Herron: use hiera3 role/nuyaml backends only in production realm [puppet] - 10https://gerrit.wikimedia.org/r/417012 (https://phabricator.wikimedia.org/T188623) [19:54:22] (03PS5) 10Bstorm: wiki-replicas: Accommodate new comments table with rules and compatibility [puppet] - 10https://gerrit.wikimedia.org/r/416496 (https://phabricator.wikimedia.org/T181650) [19:54:34] tgr: Good thing is, there's just one more Wikipedia to move to the version 2 logo, and that would be it [19:54:49] Took us just under 8 years :-P [19:55:12] twkozlowski: thanks for the logo cleanup! it's nice to see some "fix XXX on all wikis" task actually getting wrapped up for once :) [19:55:45] (03PS6) 10Bstorm: wiki-replicas: Accommodate new comments table with rules and compatibility [puppet] - 10https://gerrit.wikimedia.org/r/416496 (https://phabricator.wikimedia.org/T181650) [19:56:00] (03CR) 10Paladox: [C: 031] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/417012 (https://phabricator.wikimedia.org/T188623) (owner: 10Herron) [19:56:36] (03CR) 10Bstorm: [C: 032] wiki-replicas: Accommodate new comments table with rules and compatibility [puppet] - 10https://gerrit.wikimedia.org/r/416496 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [19:58:05] 10Operations, 10Ops-Access-Requests, 10Maps-Sprint: Give Roan Kattouw the rights to deploy maps and restart maps-related services - https://phabricator.wikimedia.org/T189153#4033165 (10Gehel) [19:59:07] PROBLEM - Check systemd state on notebook1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:59:08] PROBLEM - Check systemd state on notebook1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:59:37] me ^ [19:59:37] PROBLEM - puppet last run on notebook1003 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 9 seconds ago with 6 failures. Failed resources (up to 3 shown): Package[oozie-client],Package[mahout],Package[hadoop-client],Package[libhdfs0] [19:59:57] (03PS2) 10Ladsgroup: Convert ORES tresholds config to new syntax (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415861 (https://phabricator.wikimedia.org/T181159) (owner: 10Awight) [20:00:04] thcipriani: Time to snap out of that daydream and deploy MediaWiki train. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180307T2000). [20:00:04] (03PS9) 10Andrew Bogott: wikitech: use files from swift rather than local uploads. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416607 (https://phabricator.wikimedia.org/T188915) [20:00:04] No GERRIT patches in the queue for this window AFAICS. [20:00:06] (03PS1) 10Andrew Bogott: preliminary steps for moving wikitech to swift and hhvm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417017 (https://phabricator.wikimedia.org/T188915) [20:00:19] * thcipriani does some train [20:01:20] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install notebook100[34] - https://phabricator.wikimedia.org/T183935#4033211 (10Ottomata) [20:02:24] (03CR) 10Andrew Bogott: [C: 031] use hiera3 role/nuyaml backends only in production realm [puppet] - 10https://gerrit.wikimedia.org/r/417012 (https://phabricator.wikimedia.org/T188623) (owner: 10Herron) [20:02:37] (03CR) 10Ladsgroup: [C: 032] Convert ORES tresholds config to new syntax (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415861 (https://phabricator.wikimedia.org/T181159) (owner: 10Awight) [20:02:47] PROBLEM - puppet last run on notebook1004 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 55 seconds ago with 2 failures. Failed resources (up to 3 shown): Service[apache2],File[/etc/mysql/conf.d/research-client.cnf] [20:03:52] (03CR) 10Herron: [C: 032] "Compiler looks good https://puppet-compiler.wmflabs.org/compiler02/10338/" [puppet] - 10https://gerrit.wikimedia.org/r/417012 (https://phabricator.wikimedia.org/T188623) (owner: 10Herron) [20:03:55] (03Merged) 10jenkins-bot: Convert ORES tresholds config to new syntax (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415861 (https://phabricator.wikimedia.org/T181159) (owner: 10Awight) [20:04:10] (03PS2) 10Herron: use hiera3 role/nuyaml backends only in production realm [puppet] - 10https://gerrit.wikimedia.org/r/417012 (https://phabricator.wikimedia.org/T188623) [20:06:53] (03PS1) 10Ottomata: SWAP: set web_proxy and admin groups for new notebook100[34] [puppet] - 10https://gerrit.wikimedia.org/r/417019 (https://phabricator.wikimedia.org/T183935) [20:07:24] (03CR) 10jerkins-bot: [V: 04-1] SWAP: set web_proxy and admin groups for new notebook100[34] [puppet] - 10https://gerrit.wikimedia.org/r/417019 (https://phabricator.wikimedia.org/T183935) (owner: 10Ottomata) [20:08:31] (03CR) 10jenkins-bot: Convert ORES tresholds config to new syntax (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415861 (https://phabricator.wikimedia.org/T181159) (owner: 10Awight) [20:09:43] jouncebot: next [20:09:43] In 0 hour(s) and 50 minute(s): Services – Parsoid / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180307T2100) [20:09:46] (03PS1) 10Odder: Update logo for Banyumasan and Urdu Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417020 (https://phabricator.wikimedia.org/T189155) [20:09:57] Derp, it's train time. Nvm! [20:11:02] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/10339/notebook1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/417019 (https://phabricator.wikimedia.org/T183935) (owner: 10Ottomata) [20:11:58] (03PS2) 10Ottomata: SWAP: set web_proxy and admin groups for new notebook100[34] [puppet] - 10https://gerrit.wikimedia.org/r/417019 (https://phabricator.wikimedia.org/T183935) [20:12:20] (03CR) 10Amire80: [C: 031] Update logo for Banyumasan and Urdu Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417020 (https://phabricator.wikimedia.org/T189155) (owner: 10Odder) [20:12:31] (03CR) 10jerkins-bot: [V: 04-1] SWAP: set web_proxy and admin groups for new notebook100[34] [puppet] - 10https://gerrit.wikimedia.org/r/417019 (https://phabricator.wikimedia.org/T183935) (owner: 10Ottomata) [20:13:15] (03PS3) 10Ottomata: SWAP: set web_proxy and admin groups for new notebook100[34] [puppet] - 10https://gerrit.wikimedia.org/r/417019 (https://phabricator.wikimedia.org/T183935) [20:13:18] (03CR) 10Ottomata: [V: 032 C: 032] SWAP: set web_proxy and admin groups for new notebook100[34] [puppet] - 10https://gerrit.wikimedia.org/r/417019 (https://phabricator.wikimedia.org/T183935) (owner: 10Ottomata) [20:15:18] :) [20:15:20] (03PS6) 10Imarlier: NavigtationTiming: Enable oversampling for Singapore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415618 (https://phabricator.wikimedia.org/T188652) [20:17:21] (03CR) 10Imarlier: NavigtationTiming: Enable oversampling for Singapore (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415618 (https://phabricator.wikimedia.org/T188652) (owner: 10Imarlier) [20:17:39] (03PS10) 10Andrew Bogott: multiversion: add a transitional mapping for newwikitech.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415914 (https://phabricator.wikimedia.org/T168470) [20:17:49] (03PS1) 10Thcipriani: Group1 to php-1.31.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417022 [20:18:46] andrewbogott: thcipriani: Y'all talk [20:18:49] :D [20:18:57] what's up? [20:19:02] thcipriani: I have some wikitech-specific config patches I want to roll out [20:19:10] just wondering if/when your toes are out of the way [20:19:28] ah, I hope to be out of the way in a few minutes :) [20:19:33] I can ping you when I'm done? [20:19:37] (03PS1) 10Bstorm: wiki replicas: Small fix in maintain-views changes for comment table [puppet] - 10https://gerrit.wikimedia.org/r/417024 (https://phabricator.wikimedia.org/T181650) [20:19:37] yep! thanks [20:19:42] k will do [20:19:49] thcipriani: I could also use supervision when I actually roll things out since I haven't done this in ages [20:20:04] ah, yeah, sure I can help with that after train no problem [20:20:56] thcipriani: you're a boss, just sayin' [20:22:47] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [20:23:25] * thcipriani doffs hat [20:23:47] (03CR) 10Thcipriani: [C: 032] Group1 to php-1.31.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417022 (owner: 10Thcipriani) [20:24:37] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [20:24:58] (03Merged) 10jenkins-bot: Group1 to php-1.31.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417022 (owner: 10Thcipriani) [20:27:49] !log thcipriani@tin rebuilt and synchronized wikiversions files: Group1 to php-1.31.0-wmf.24 [20:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:20] (03CR) 10jenkins-bot: Group1 to php-1.31.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417022 (owner: 10Thcipriani) [20:31:58] (03PS3) 10Smalyshev: Add configuration for CirrusSearch to instantly index new Wikidata items [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413899 (https://phabricator.wikimedia.org/T183053) [20:35:52] andrewbogott: okie doke looking at patches now, we can treat this like a SWAT if you'd rather I deployed stuff or I can monitor and help. [20:36:20] thcipriani: I could use the practice if you don't mind watching [20:36:23] first I'm going to do https://gerrit.wikimedia.org/r/#/c/415914/ [20:36:36] sure thing, sounds good :) [20:36:51] I'm done on tin, so it's all yours [20:37:01] ok, so first, +2 [20:37:16] (03CR) 10Andrew Bogott: [C: 032] multiversion: add a transitional mapping for newwikitech.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415914 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [20:37:48] (03PS11) 10Andrew Bogott: multiversion: add a transitional mapping for newwikitech.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415914 (https://phabricator.wikimedia.org/T168470) [20:37:50] this is either helpful or overwhelming: https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers [20:37:52] :) [20:37:54] hm, need to untangle some dependencies [20:38:40] (03PS12) 10Andrew Bogott: multiversion: add a transitional mapping for newwikitech.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415914 (https://phabricator.wikimedia.org/T168470) [20:41:19] (03CR) 10Dzahn: [C: 04-1] "http://puppet-compiler.wmflabs.org/10340/terbium.eqiad.wmnet/change.terbium.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/416751 (owner: 10Dzahn) [20:41:24] (03CR) 10Andrew Bogott: multiversion: add a transitional mapping for newwikitech.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415914 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [20:41:27] (03CR) 10Andrew Bogott: [C: 032] multiversion: add a transitional mapping for newwikitech.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415914 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [20:42:21] (03PS2) 10Andrew Bogott: preliminary steps for moving wikitech to swift and hhvm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417017 (https://phabricator.wikimedia.org/T188915) [20:42:23] (03CR) 10Dzahn: [C: 04-1] "there is still the apache class applied on the maintenance servers through the mediawiki module, so apache and httpd will always conflict " [puppet] - 10https://gerrit.wikimedia.org/r/416751 (owner: 10Dzahn) [20:42:27] (03PS10) 10Andrew Bogott: wikitech: use files from swift rather than local uploads. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416607 (https://phabricator.wikimedia.org/T188915) [20:42:41] (03Merged) 10jenkins-bot: multiversion: add a transitional mapping for newwikitech.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415914 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [20:42:54] (03CR) 10Dzahn: [C: 04-2] noc: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/416751 (owner: 10Dzahn) [20:42:56] (03CR) 10jenkins-bot: multiversion: add a transitional mapping for newwikitech.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415914 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [20:43:22] ok, now doing fetch/rebase in /srv/mediawiki-staging [20:43:36] okie doke [20:44:21] and scap sync-file multiversion/MWMultiVersion.php [20:45:07] yeah, if you want a message for sal you can add that at the end [20:45:46] !log andrew@tin Synchronized multiversion/MWMultiVersion.php: (no justification provided) (duration: 01m 16s) [20:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:51] thcipriani: so that's all the steps, right? I'm not leaving things in an inconsistent state that requires a full sync now? [20:48:04] that's everything for that patch [20:48:33] great. I'll do https://gerrit.wikimedia.org/r/#/c/417017/ now then [20:48:38] in general if you don't have l10nupdates you can just sync files one at a time with sync-file. [20:48:41] (03CR) 10Andrew Bogott: [C: 032] preliminary steps for moving wikitech to swift and hhvm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417017 (https://phabricator.wikimedia.org/T188915) (owner: 10Andrew Bogott) [20:49:45] if you want files to arrive in a certain order, i.e. one depends on the other using sync-file is the only way to really ensure that they arrive in that order. If files in a folder are orthogonal you can do a scap sync-dir on the whole directory. [20:51:11] (03Merged) 10jenkins-bot: preliminary steps for moving wikitech to swift and hhvm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417017 (https://phabricator.wikimedia.org/T188915) (owner: 10Andrew Bogott) [20:51:53] 10Operations, 10Traffic, 10Patch-For-Review: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#4033339 (10daniel) [20:52:44] (03CR) 10jenkins-bot: preliminary steps for moving wikitech to swift and hhvm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417017 (https://phabricator.wikimedia.org/T188915) (owner: 10Andrew Bogott) [20:54:23] thcipriani: can scap sync-file take two files as args, or do I need to do it one at a time? [20:54:50] if they share a common directory you can pass that directory [20:55:00] but sync-file only takes one path at a time [20:55:13] ok, I'll just do this one at a time, easier to understand [20:55:21] (03PS4) 10Herron: add puppetdb role to puppetdb[12]001 servers [puppet] - 10https://gerrit.wikimedia.org/r/409995 (https://phabricator.wikimedia.org/T185499) [20:56:33] !log andrew@tin Synchronized wmf-config/CommonSettings.php: Preparing wikitech to use swift for images, step one (duration: 01m 16s) [20:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:46] (03PS2) 10Madhuvishy: dumps: Refactor fetcher profile [puppet] - 10https://gerrit.wikimedia.org/r/416983 (https://phabricator.wikimedia.org/T188727) [20:58:15] (03PS2) 10Krinkle: Improve load-order documentation for CommonSettings and InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416879 [20:58:21] (03PS5) 10Herron: add puppetdb role to puppetdb[12]001 servers [puppet] - 10https://gerrit.wikimedia.org/r/409995 (https://phabricator.wikimedia.org/T185499) [20:58:34] no_justification: ^ looks like that patch didnt get merged somehow [20:58:44] (doc patch) [20:58:46] !log andrew@tin Synchronized wmf-config/filebackend.php: Preparing wikitech to use swift for images, step two (duration: 01m 12s) [20:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:10] (03PS11) 10Andrew Bogott: wikitech: use files from swift rather than local uploads. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416607 (https://phabricator.wikimedia.org/T188915) [21:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180307T2100). [21:00:04] No GERRIT patches in the queue for this window AFAICS. [21:00:46] Krinkle: sometimes gerrit tells you it needs a rebase *before* you +2, sometimes it doesn't. I can ensure it gets merged here after andrewbogott is done on tin. [21:01:06] (03CR) 10Andrew Bogott: [C: 032] wikitech: use files from swift rather than local uploads. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416607 (https://phabricator.wikimedia.org/T188915) (owner: 10Andrew Bogott) [21:01:23] one more to go, should be done soon [21:01:28] no rush [21:01:42] and then I will test, and then no doubt have more patches :( [21:02:09] nothing for parsoid today [21:02:27] (03Merged) 10jenkins-bot: wikitech: use files from swift rather than local uploads. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416607 (https://phabricator.wikimedia.org/T188915) (owner: 10Andrew Bogott) [21:02:42] (03CR) 10jenkins-bot: wikitech: use files from swift rather than local uploads. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416607 (https://phabricator.wikimedia.org/T188915) (owner: 10Andrew Bogott) [21:05:17] !log andrew@tin Synchronized wmf-config/InitialiseSettings.php: Switch wikitech to swift (duration: 01m 15s) [21:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:02] (03PS2) 10Madhuvishy: dumps: Enable fetcher for labstore1006|7 [puppet] - 10https://gerrit.wikimedia.org/r/416984 (https://phabricator.wikimedia.org/T188727) [21:08:42] thcipriani, Krinkle: Discrepancy in gerrit UI is caching behavior [21:08:50] Calculating "is this mergeable" is expensive. [21:09:35] thcipriani: ok, I think I'm done for now. Thank you! [21:10:07] andrewbogott: cool, kudos on the deploy :) [21:10:40] of course since the best outcome is no change I'm having a hard time convincing myself that this is really using swift now [21:10:51] no_justification: makes sense. I wish it would let you rebase when you know better though! http://tyler.zone/why-gerrit.gif [21:13:26] (03PS3) 10Thcipriani: Improve load-order documentation for CommonSettings and InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416879 (owner: 10Krinkle) [21:15:02] thcipriani i think upstream may have improved this though not sure. [21:15:15] judging by the commit msg on https://gerrit-review.googlesource.com/#/c/gerrit/+/152510/ it seems so [21:16:24] and it's being used in pg here https://gerrit-review.googlesource.com/#/c/gerrit/+/153850/ [21:16:40] it's not too much of a problem really. I run into it maybe once every few weeks and it's easy to fix. minor annoyance. [21:17:03] (03CR) 10Thcipriani: [C: 032] Improve load-order documentation for CommonSettings and InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416879 (owner: 10Krinkle) [21:17:59] (03PS1) 10DCausse: Switch mjolnir kafka broker to jumbo-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/417038 (https://phabricator.wikimedia.org/T188408) [21:18:13] no_justification: I uploaded an image on wikitech and viewed it on newwikitech, so I conclude that uploading to swift is working. [21:18:15] thank you for your help! [21:18:33] (03Merged) 10jenkins-bot: Improve load-order documentation for CommonSettings and InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416879 (owner: 10Krinkle) [21:18:48] (03CR) 10jenkins-bot: Improve load-order documentation for CommonSettings and InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416879 (owner: 10Krinkle) [21:21:59] andrewbogott: yw [21:24:51] !log thcipriani@tin Synchronized wmf-config: [[gerrit:416879|Improve load-order documentation for CommonSettings and InitialiseSettings]] noop doc change (duration: 01m 18s) [21:25:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:36] (03CR) 10Anomie: [C: 04-1] wiki replicas: Small fix in maintain-views changes for comment table (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/417024 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [21:38:50] (03CR) 10Bstorm: "Ah ok. I was trying to guess at this one because this doesn't seem to appear in the current schema." [puppet] - 10https://gerrit.wikimedia.org/r/417024 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [21:39:45] (03PS2) 10Bstorm: wiki replicas: Small fix in maintain-views changes for comment table [puppet] - 10https://gerrit.wikimedia.org/r/417024 (https://phabricator.wikimedia.org/T181650) [21:40:25] (03PS1) 10Herron: puppetdb: add puppetdb4 apt component when puppetdb_major_version >= 4 [puppet] - 10https://gerrit.wikimedia.org/r/417054 (https://phabricator.wikimedia.org/T185502) [21:45:10] !log temporarily disabling puppet agents while puppetdb servers nitrogen and nihal are rebooted for kernel updates [21:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:40] (03PS1) 10Smalyshev: Add consumer ID to Updater launch string [puppet] - 10https://gerrit.wikimedia.org/r/417115 (https://phabricator.wikimedia.org/T188716) [21:50:41] (03PS2) 10Rush: openstack: initial nova setup for mitaka [puppet] - 10https://gerrit.wikimedia.org/r/416608 (https://phabricator.wikimedia.org/T188266) [21:51:35] !log puppetdb server reboots complete — re-enabling puppet agents [21:51:36] (03PS1) 10Andrew Bogott: wikitech.wikimedia.org: change ttl to 5M [dns] - 10https://gerrit.wikimedia.org/r/417163 [21:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:11] (03PS3) 10Rush: openstack: initial nova setup for mitaka [puppet] - 10https://gerrit.wikimedia.org/r/416608 (https://phabricator.wikimedia.org/T188266) [21:58:13] (03PS2) 10Herron: puppetdb: add puppetdb4 apt component when puppetdb_major_version == 4 [puppet] - 10https://gerrit.wikimedia.org/r/417054 (https://phabricator.wikimedia.org/T185502) [22:00:05] ejegg: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for security patch. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180307T2200). [22:00:05] No GERRIT patches in the queue for this window AFAICS. [22:00:42] (03CR) 10Rush: [C: 032] openstack: initial nova setup for mitaka [puppet] - 10https://gerrit.wikimedia.org/r/416608 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [22:06:26] (03CR) 10Andrew Bogott: [C: 032] wikitech.wikimedia.org: change ttl to 5M [dns] - 10https://gerrit.wikimedia.org/r/417163 (owner: 10Andrew Bogott) [22:18:36] hi ops, I just saw the error rate check fail on 1/11 canaries, but it seems totally unrelated to the thing I just scapped: https://logstash.wikimedia.org/goto/2cc7028226a539553178454fc2f14459 [22:18:42] any ideas? [22:19:28] thcipriani: ^ [22:19:46] or twentyafterfour ^ [22:20:52] * thcipriani looks [22:21:52] ejegg: if it was 1/11 deploy didn't fail correct? The error in the logs is an ongoing one so is likely unrelated to your deployment. [22:22:18] correct, with 1/11 it let the deploy continue [22:22:44] so if that's a known thing, I'll deploy the patch to the wmf.23 wikis [22:23:39] !log deployed patch for T171987 to 1.31.0-wmf.24 [22:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:49] !log deployed patch for T171987 to 1.31.0-wmf.23 [22:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:04] MaxSem: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for decurity deployment . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180307T2230). [22:30:04] No GERRIT patches in the queue for this window AFAICS. [22:30:10] woo [22:44:37] !log dumping centralauth.spoofuser from db1079 [22:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:39] (03PS2) 10BryanDavis: openstack: Refactor dns-floating-ip-updater.py script [puppet] - 10https://gerrit.wikimedia.org/r/416991 [22:51:41] (03PS1) 10BryanDavis: openstack: Refactor /root/novaenv.sh [puppet] - 10https://gerrit.wikimedia.org/r/417169 [22:51:43] (03PS1) 10BryanDavis: openstack: Promote DnsManager to mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/417170 [22:52:20] !log maxsem@tin Synchronized php-1.31.0-wmf.24/extensions/CentralAuth/: https://gerrit.wikimedia.org/r/#/c/417014/ (duration: 01m 20s) [22:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:50] (03CR) 10BryanDavis: "> I think this would be easier to review/test if you break out the" [puppet] - 10https://gerrit.wikimedia.org/r/416991 (owner: 10BryanDavis) [22:54:24] 10Operations, 10Traffic, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): How to purge misc-web varnishes for wikitech changes? - https://phabricator.wikimedia.org/T189168#4033759 (10Andrew) @bblack, I suspect you're the one who would know an answer for this [22:54:28] 10Operations, 10Traffic, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): How to purge misc-web varnishes for wikitech changes? - https://phabricator.wikimedia.org/T189168#4033761 (10bd808) [22:59:47] (03PS3) 10Madhuvishy: nfs-mount-manager: Add option to kill process accessing a mount [puppet] - 10https://gerrit.wikimedia.org/r/408864 (https://phabricator.wikimedia.org/T171540) [23:00:17] !log maxsem@tin Synchronized php-1.31.0-wmf.24/extensions/AntiSpoof/: https://gerrit.wikimedia.org/r/#/c/417013/ (duration: 01m 16s) [23:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:58] !log running script for T187516 [23:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:02] PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 602.93 seconds [23:36:45] !log aborted due to growing DB lag [23:36:45] 10Operations, 10Traffic, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): How to purge misc-web varnishes for wikitech changes? - https://phabricator.wikimedia.org/T189168#4033862 (10Krenair) hieradata/role/common/cache/misc.yaml:profile::cache::base::purge_multicasts: ['239.128.0.115'] hieradata/... [23:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:27] 1002 [23:50:11] (03PS1) 10BryanDavis: wikitech: use FQDNs for m5 cluster members [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417173 [23:54:54] jouncebot: next [23:54:54] In 0 hour(s) and 5 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180308T0000) [23:56:23] jouncebot: refresh [23:56:25] I refreshed my knowledge about deployments. [23:57:12] RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 118.48 seconds