[00:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161103T0000). [00:03:25] James_F: we're waiting 3 Jenkins jobs [00:04:00] Dereckson: Isn't it fun? [00:04:33] for a setting in JS, probably [00:05:34] (03PS1) 10RobH: new labstore partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/319500 [00:05:41] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:05:51] RECOVERY - puppet last run on es1011 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [00:06:30] yurik: still alive? [00:09:33] James_F: live on mw1099 [00:10:11] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [00:10:14] Thanks. [00:12:31] (03PS1) 10Yuvipanda: jupyterhub: Don't set HTTP_PROXY on jupyterhub itself [puppet] - 10https://gerrit.wikimedia.org/r/319501 (https://phabricator.wikimedia.org/T149543) [00:12:55] (03CR) 10Reedy: [C: 04-1] "I think we're going to have to have both SF and PF in one branch for the transition. So we can't just blindly swap one for the other until" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319131 (https://phabricator.wikimedia.org/T149749) (owner: 10Paladox) [00:13:35] (03CR) 10Paladox: "@Reedy that has already been done." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319131 (https://phabricator.wikimedia.org/T149749) (owner: 10Paladox) [00:14:02] (03PS2) 10Yuvipanda: jupyterhub: Don't set HTTP_PROXY on jupyterhub itself [puppet] - 10https://gerrit.wikimedia.org/r/319501 (https://phabricator.wikimedia.org/T149543) [00:14:09] (03CR) 10Yuvipanda: [C: 032 V: 032] jupyterhub: Don't set HTTP_PROXY on jupyterhub itself [puppet] - 10https://gerrit.wikimedia.org/r/319501 (https://phabricator.wikimedia.org/T149543) (owner: 10Yuvipanda) [00:14:55] (03CR) 10Reedy: "Have all the messages been re-keyed?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319131 (https://phabricator.wikimedia.org/T149749) (owner: 10Paladox) [00:15:11] RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 210 seconds ago with 0 failures [00:15:13] Dereckson, yep [00:15:21] (03CR) 10Paladox: "Yes, it has been added to translatewiki." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319131 (https://phabricator.wikimedia.org/T149749) (owner: 10Paladox) [00:15:38] (03CR) 10Paladox: "@Reedy see https://phabricator.wikimedia.org/T147582" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319131 (https://phabricator.wikimedia.org/T149749) (owner: 10Paladox) [00:15:46] yurik: patch live on mw1099 [00:15:55] Gah, slow network is slow. [00:16:13] (03CR) 10Paladox: "https://gerrit.wikimedia.org/r/#/c/317516/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319131 (https://phabricator.wikimedia.org/T149749) (owner: 10Paladox) [00:16:56] (03CR) 10Paladox: "https://phabricator.wikimedia.org/diffusion/EPFM/repository/master/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319131 (https://phabricator.wikimedia.org/T149749) (owner: 10Paladox) [00:17:51] Dereckson, seems fine [00:17:56] Dereckson: Hmm, are you sure it's live? [00:18:10] yurik Dereckson yep, I confirm too, seems fine [00:18:12] yurik: ack'ed [00:18:15] ok [00:18:21] James_F: double checking that [00:18:43] * James_F refreshes. [00:20:36] James_F: yes sure, mw1099 and tin MD5 match, content doesn't have the removed lines [00:21:06] Kk. [00:21:06] Re-reloading. [00:21:59] bblack: I filed https://phabricator.wikimedia.org/T149865 earlier, feel free to update :) [00:23:37] Dereckson: Yay, LGTM. [00:23:57] !log dereckson@tin Synchronized php-1.29.0-wmf.1/extensions/Kartographer/styles/: Set font size to 14px for both static and interactive maps (T149860) (duration: 00m 47s) [00:24:02] James_F: ok, syncing [00:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:06] T149860: Snapshot container inherits font size from parent - https://phabricator.wikimedia.org/T149860 [00:25:04] Dereckson: Thank you. [00:26:47] (03PS1) 10Yuvipanda: jupyterhub: Do not use proxying when talking to localhost [puppet] - 10https://gerrit.wikimedia.org/r/319503 (https://phabricator.wikimedia.org/T149543) [00:27:04] (03PS2) 10Yuvipanda: jupyterhub: Do not use proxying when talking to localhost [puppet] - 10https://gerrit.wikimedia.org/r/319503 (https://phabricator.wikimedia.org/T149543) [00:27:13] (03CR) 10Yuvipanda: [C: 032 V: 032] jupyterhub: Do not use proxying when talking to localhost [puppet] - 10https://gerrit.wikimedia.org/r/319503 (https://phabricator.wikimedia.org/T149543) (owner: 10Yuvipanda) [00:28:48] !log dereckson@tin Synchronized php-1.29.0-wmf.1/resources/src/mediawiki.widgets/mw.widgets.TitleWidget.js: Follow-up Id0021594: Remove extra code for redlink suggestions (T149130) (duration: 00m 46s) [00:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:55] T149130: [Regression wmf.23] MW's title suggestion widget displays a suggestion for "foo" and "Foo" for non-existing pages on input of "foo" (OK), and for "Foo" and "Foo" on input of "Foo" (not OK) - https://phabricator.wikimedia.org/T149130 [00:29:03] Yay. [00:29:26] 06Operations, 10Traffic: 503 errors for users connecting to esams - https://phabricator.wikimedia.org/T149865#2767016 (10fgiunchedi) Update: it was due to a suspected problem with eqiad<->esams wave link, @bblack has failed over to the MPLS eqiad<->knams link and things seem to have stabilized for now. [00:31:05] (03PS1) 10Yuvipanda: jupyterhub: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/319504 (https://phabricator.wikimedia.org/T149543) [00:31:43] (03PS2) 10Yuvipanda: jupyterhub: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/319504 (https://phabricator.wikimedia.org/T149543) [00:31:47] (03CR) 10Yuvipanda: [C: 032 V: 032] jupyterhub: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/319504 (https://phabricator.wikimedia.org/T149543) (owner: 10Yuvipanda) [00:33:13] 06Operations, 10ops-eqiad, 10hardware-requests, 06Services (watching): Reclaim aqs100[123] - https://phabricator.wikimedia.org/T147926#2767038 (10Cmjohnson) [00:38:48] 06Operations, 10hardware-requests: eqiad/codfw: swift frontend hardware refresh - https://phabricator.wikimedia.org/T148510#2767056 (10RobH) 05Open>03stalled a:03RobH I've requested quotes on the sub-task in the #procurement S4 space. I'm setting this as assigned to me, and stalled, until quotes are bac... [00:40:06] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [00:45:06] RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 190 seconds ago with 0 failures [01:16:21] (03CR) 10Dzahn: [C: 032] "[bast2001:~] $ host 10.193.1.7" [dns] - 10https://gerrit.wikimedia.org/r/319490 (owner: 10Dzahn) [01:16:37] (03PS2) 10Dzahn: add forward DNS for papaul-laptop.mgmt [dns] - 10https://gerrit.wikimedia.org/r/319490 [01:18:18] PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:20:49] (03CR) 10Dzahn: [C: 032] "[radon:~] $ host 10.64.16.16" [dns] - 10https://gerrit.wikimedia.org/r/319495 (https://phabricator.wikimedia.org/T135253) (owner: 10Dzahn) [01:21:03] (03PS3) 10Dzahn: remove db1027 remnants (reverse lookup) [dns] - 10https://gerrit.wikimedia.org/r/319495 (https://phabricator.wikimedia.org/T135253) [01:23:55] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/#/c/288943/" [dns] - 10https://gerrit.wikimedia.org/r/319495 (https://phabricator.wikimedia.org/T135253) (owner: 10Dzahn) [01:25:57] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/319495" [dns] - 10https://gerrit.wikimedia.org/r/289168 (https://phabricator.wikimedia.org/T135253) (owner: 10Jcrespo) [01:25:59] (03PS1) 10Yuvipanda: jupyterhub: widen group of users who can log in [puppet] - 10https://gerrit.wikimedia.org/r/319506 (https://phabricator.wikimedia.org/T149543) [01:26:52] (03PS3) 10Dzahn: phabricator: Fix empty "parentProject" when new project is a milestone [puppet] - 10https://gerrit.wikimedia.org/r/318699 (owner: 10Aklapper) [01:26:54] (03CR) 10jenkins-bot: [V: 04-1] jupyterhub: widen group of users who can log in [puppet] - 10https://gerrit.wikimedia.org/r/319506 (https://phabricator.wikimedia.org/T149543) (owner: 10Yuvipanda) [01:32:01] (03PS2) 10Yuvipanda: jupyterhub: widen group of users who can log in [puppet] - 10https://gerrit.wikimedia.org/r/319506 (https://phabricator.wikimedia.org/T149543) [01:34:42] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 1802.791511 Seconds [01:34:42] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 1802.797276 Seconds [01:35:42] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 30.301846 Seconds [01:35:42] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 30.306532 Seconds [01:36:05] (03CR) 10Yuvipanda: [C: 032 V: 032] jupyterhub: widen group of users who can log in [puppet] - 10https://gerrit.wikimedia.org/r/319506 (https://phabricator.wikimedia.org/T149543) (owner: 10Yuvipanda) [01:42:22] (03CR) 10Dzahn: [V: 032] phabricator: Fix empty "parentProject" when new project is a milestone [puppet] - 10https://gerrit.wikimedia.org/r/318699 (owner: 10Aklapper) [01:42:27] (03PS4) 10Dzahn: phabricator: Fix empty "parentProject" when new project is a milestone [puppet] - 10https://gerrit.wikimedia.org/r/318699 (owner: 10Aklapper) [01:42:35] (03CR) 10Dzahn: [V: 032] phabricator: Fix empty "parentProject" when new project is a milestone [puppet] - 10https://gerrit.wikimedia.org/r/318699 (owner: 10Aklapper) [01:46:22] RECOVERY - puppet last run on cp1066 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [01:46:39] (03CR) 10Dzahn: [C: 032] "https://github.com/mwclient/mwclient/commit/61155f1b7ed118c970a5749f87b8f6dbbc423aaa" [debs/adminbot] - 10https://gerrit.wikimedia.org/r/319137 (https://phabricator.wikimedia.org/T124852) (owner: 10MtDu) [01:46:57] (03PS1) 10Yuvipanda: jupyterhub: Add additional protections against arbitrary user login [puppet] - 10https://gerrit.wikimedia.org/r/319507 (https://phabricator.wikimedia.org/T149543) [01:47:21] (03CR) 10Dzahn: [V: 032] Use page.text() instead of deprecated page.edit() [debs/adminbot] - 10https://gerrit.wikimedia.org/r/319137 (https://phabricator.wikimedia.org/T124852) (owner: 10MtDu) [01:47:31] (03PS2) 10Yuvipanda: jupyterhub: Add additional protections against arbitrary user login [puppet] - 10https://gerrit.wikimedia.org/r/319507 (https://phabricator.wikimedia.org/T149543) [01:50:58] (03CR) 10Yuvipanda: [C: 032 V: 032] jupyterhub: Add additional protections against arbitrary user login [puppet] - 10https://gerrit.wikimedia.org/r/319507 (https://phabricator.wikimedia.org/T149543) (owner: 10Yuvipanda) [01:54:05] (03PS1) 10Yuvipanda: jupyterhub: Call parent coroutine properly [puppet] - 10https://gerrit.wikimedia.org/r/319508 [01:54:21] (03CR) 10Yuvipanda: [C: 032 V: 032] jupyterhub: Call parent coroutine properly [puppet] - 10https://gerrit.wikimedia.org/r/319508 (owner: 10Yuvipanda) [01:55:42] jenkins (zuul) fixed by https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Gearman_deadlock [01:56:07] yuvipanda: it should start voting again (i think) [01:56:32] ah nice [01:56:33] thanks mutante [01:56:40] I'm mostly done for the day I think tho [02:09:47] mutante , any reason that (and moth other things) is not moved to wikitech? [02:09:54] most* [02:13:51] arseny92: no [02:14:16] (i mean, i dont see any, but i didnt make it) [02:14:17] grrrit-wm: restart [02:14:18] re-connecting to gerrit [02:14:19] reconnected to gerrit [02:14:26] mutante ^^ works [02:14:54] all that shows is it outputted two messages [02:15:03] Reedy unerneeth i have the logs [02:15:39] the real test will be when the next gerrit restart happens [02:15:41] i'll try then [02:15:42] paladox => #wikimedia-operations grrrit-wm: restart [02:15:42] info: Connecting to gerrit.. [02:15:42] Client disconnected [02:15:42] info: Connecting to gerrit.. [02:15:42] re-connecting to gerrit [02:15:43] info: Connected; requesting stream-events [02:15:43] reconnected to gerrit [02:15:44] info: Connected to event stream! [02:15:48] mutante ^^ Reedy ^^ [02:15:50] lol [02:15:57] yeh [02:16:21] mutante i can try that now with gerrit, i am using the test gerrit server [02:16:28] gerrit.git.wmflabs.org [02:17:00] paladox: a) it's nice that it doesnt have to kill the IRC connection but b) please dont keep using the prod channel, use the test channel for that [02:17:35] mutante i have, i just want to make sure it works with the prod bot [02:17:47] sometimes my changes work with the test bot but wont work with the prod bot [02:19:04] paladox: ok, i know you cant just easily create a test-kubernetes, i get that [02:19:14] just not more than needed [02:19:25] ok [02:19:41] sometimes stuff works on dev/ppe but not on prod, so prod testing needed [02:19:42] mutante it works, restarting the gerrit test instance [02:19:52] gets the bot to automatically reconnect [02:19:59] info: Connecting to gerrit.. [02:19:59] Client error: Error: connect ECONNREFUSED 10.68.23.148:29418 [02:19:59] Client disconnected [02:19:59] info: Connecting to gerrit.. [02:20:00] Client error: Error: connect ECONNREFUSED 10.68.23.148:29418 [02:20:02] Client disconnected [02:20:04] info: Connecting to gerrit.. [02:20:06] info: Connected; requesting stream-events [02:20:08] info: Connected to event stream! [02:20:12] paladox: test channel please [02:20:22] such as the time when wmf did a public test of datacenter change for example [02:20:33] ok [02:20:41] mutante i did [02:22:03] yes, and i joined. you can paste there [02:22:11] ok [02:25:19] (03PS1) 10Dzahn: changepw: continue if login fails on a host [puppet] - 10https://gerrit.wikimedia.org/r/319510 [02:30:25] (03PS3) 10Dzahn: tcpircbot: improve firewall rule setup [puppet] - 10https://gerrit.wikimedia.org/r/316497 [02:30:57] !log l10nupdate@tin scap sync-l10n completed (1.28.0-wmf.23) (duration: 12m 08s) [02:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:31:50] yurik , phabtask that describes the maps deployment as static images? Have to tag/ref your announcement on hewiki VP appropriately [02:33:05] arseny92, #wikimedia-interactive [02:34:31] (03CR) 10Dzahn: "http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_09_05.html section 9.5.2" [puppet] - 10https://gerrit.wikimedia.org/r/319510 (owner: 10Dzahn) [02:35:12] (03CR) 10Papaul: [C: 04-1] "We don't need continue on line 41" [puppet] - 10https://gerrit.wikimedia.org/r/319510 (owner: 10Dzahn) [02:36:02] (03Abandoned) 10Dzahn: changepw: continue if login fails on a host [puppet] - 10https://gerrit.wikimedia.org/r/319510 (owner: 10Dzahn) [02:46:57] PROBLEM - Varnish HTTP text-backend - port 3128 on cp2019 is CRITICAL: connect to address 10.192.48.23 and port 3128: Connection refused [02:49:57] RECOVERY - Varnish HTTP text-backend - port 3128 on cp2019 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.000 second response time [02:53:28] 06Operations, 10Wikimedia-Stream, 07HTTPS, 13Patch-For-Review: stream.wikimedia.org speaks http (not https) on port 443 - https://phabricator.wikimedia.org/T102313#2767183 (10jeremyb) [02:53:31] 06Operations, 06Performance-Team, 10Traffic, 10Wikimedia-Stream, and 2 others: HTTPS-only for stream.wikimedia.org - https://phabricator.wikimedia.org/T140128#2767182 (10jeremyb) [02:55:08] (03PS2) 10Madhuvishy: new labstore partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/319500 (https://phabricator.wikimedia.org/T149870) (owner: 10RobH) [02:59:29] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.1) (duration: 11m 28s) [02:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:05] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Nov 3 03:05:05 UTC 2016 (duration 5m 36s) [03:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:14:46] PROBLEM - puppet last run on cp1049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:15:26] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [03:20:26] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [03:22:21] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 779.43 seconds [03:25:11] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [03:25:29] 06Operations, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review, 07Wikimedia-Incident: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2767234 (10Andrew) Apologies if I'm repeating previous comments... This issue is produced in two stages: 1) designa... [03:30:11] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [03:32:21] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 273.70 seconds [03:34:02] i keep trying to find rulez about this - is it ok to change *-labs config files and sync-file them (not sync-dir) in production in off hours? [03:35:21] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [03:40:21] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [03:40:22] yurik: no [03:41:33] primarily because with any code change hhvm could barf and every sync touches the settings file to ensure that config cache is invalidated [03:42:51] RECOVERY - puppet last run on cp1049 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [03:44:54] !log wikitech-static package updates [03:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:45:11] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [03:45:25] bd808, huh, I thought we decided the *-labs files were okay [03:45:37] though I had forgotten about the automatic touch of that file [03:46:34] Krenair, my point exactly - no rulez, wild wild west (WWW) [03:46:37] :) [03:46:46] are we ok to deploy now? since both of you are around? [03:46:56] the rule is "don't do something dumb" [03:47:04] neither of us are roots [03:47:23] we do have the ability to restart hhvm [03:47:23] true that... sigh [03:47:30] but that's about it [03:47:45] ok, point taken, guess will wait till tomoroww [03:48:25] bd808, we should reconsider requiring that automatic touch though [03:49:15] given it evidently gives theoretically safe deployments the potential to make things break [03:49:31] Krenair: well I put it in for a really good reason. namely that there were a large number of "oh shit, better touch config and resync" errors that used to happen [03:49:51] bd808, sure it should be on by default [03:49:52] but a --no-touch flag would be ok I guess [03:49:57] yeah [03:50:09] that's what I had in mind [03:50:21] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [03:50:44] bd808, should I file a task? [03:51:07] I still don't like the idea of random deploys for non-critical issues. WP:DEADLINE [03:51:29] Krenair: sure. it should be fairly easy to add in [03:53:47] the potential is there even for non-random deployments [03:55:30] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [03:56:31] bd808, it was the InitialiseSettings.php file that gets touched, right? [03:56:51] * bd808 double checks [03:59:01] Krenair: yes -- https://github.com/wikimedia/scap/blob/master/scap/tasks.py#L384-L390 [03:59:25] happens at the end of `scap pull` on the MW server end [04:00:10] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [04:00:40] RECOVERY - Last backup of the maps filesystem on labstore1001 is OK: OK - Last run for unit replicate-maps was successful [04:01:05] https://phabricator.wikimedia.org/T149872 [04:02:36] thanks [04:03:20] yw [04:05:10] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [04:09:32] bd808, Krenair - just the other day i ran into this major problem with the touch - i had to manually touch and resync - https://phabricator.wikimedia.org/T149618 [04:10:00] basically it somehow rolled back the previous deployment [04:10:04] (in cache) [04:10:14] yurik: sunc-dir for wmf-config is a horrible idea [04:10:30] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [04:10:33] bd808, oh?? i thought it was the recommended way [04:10:37] no [04:10:41] if not, we should disable it [04:11:11] you want to make a blacklist of paths for sync-dir? [04:11:32] sure [04:11:36] It's not really just wmf-config [04:11:40] if its not obvious :) [04:11:52] sync-dir in general is something to be careful with [04:11:55] the files get updated on the live wikis based on inode order by rsync. you have no idea which order that really will be [04:12:57] well, most deployments i have done, e.g. for an extension, i used sync-dir [04:12:59] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic: CentralNotice: Review and update Varnish caching for Special:BannerLoader - https://phabricator.wikimedia.org/T149873#2767310 (10AndyRussG) [04:13:28] and I'm pretty sure that it is documented best practice to sync InitialiseSettings.php with new vars first, wait for everything to settle and then sync CommonSettings.php [04:13:35] considering that i have done tons of depls, for all sorts of stuff for many years, and only now i hear that sync-dir is evil... we should have auto-warning of sorts :) [04:13:54] it makes perfect sense btw :) [04:14:21] but one would have to think about it instead of simply rely on "this magical tool has been tried and tested and works well" [04:14:37] in theory, the sync-dir would do those steps ;) [04:14:38] there are no magic tools yurik [04:14:51] you haven't seen my toolbox ;) [04:15:06] I've seen the things your toolbox creates on wiki [04:15:12] they are all magical... except for the debugger [04:15:20] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [04:15:22] (03PS3) 10Madhuvishy: new labstore partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/319500 (https://phabricator.wikimedia.org/T149870) (owner: 10RobH) [04:15:46] * yurik objects! :-P [04:15:53] unless those were good things of course [04:17:10] PROBLEM - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/dumps - 288 bytes in 0.093 second response time [04:17:43] (03PS4) 10Madhuvishy: new labstore partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/319500 (https://phabricator.wikimedia.org/T149870) (owner: 10RobH) [04:20:20] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [04:22:07] bd808, hey, doesn't l10nupdate use scap? [04:24:07] yeah. it uses the `scap sync-l10n` command [04:24:33] which doesn't do the touch? [04:24:43] checking on that now :) [04:25:32] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [04:25:38] (03CR) 10Madhuvishy: [C: 032 V: 032] new labstore partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/319500 (https://phabricator.wikimedia.org/T149870) (owner: 10RobH) [04:25:52] because this thing runs automatically, unsupervised, every night [04:26:02] yeah. and it does the touch [04:26:49] :| [04:27:08] that sounds like something we should change [04:27:12] so either a) I'm being a jerk by saying that's why middle of the night syncs are bad or b) we've been lucky [04:27:43] I've been on the deployer end of the HHVM crash-on-harmless-deploy bug before [04:27:57] When I was a relatively new deployer [04:27:58] I'm going with B [04:28:21] let's update your feature request :) [04:28:52] I think users got 5xx errors when it happened to me [04:29:23] I think job runners crashed then one time I was hit by it [04:29:32] 503? Krenair [04:30:03] Zppix|mobile, hmm... possibly that one. might have been 502 [04:30:16] bd808, pff, job runners :) [04:30:32] 502 i thought was client not serv [04:30:32] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [04:30:56] Zppix|mobile, what? [04:31:08] Client side [04:31:22] yeah I got that [04:31:35] I still don't understand your point [04:33:19] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic: CentralNotice: Review and update Varnish caching for Special:BannerLoader - https://phabricator.wikimedia.org/T149873#2767335 (10aaron) The first approach might work using Varnish xkey support. I'm not how far along we ar... [04:35:32] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [04:40:22] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [04:45:22] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [04:50:12] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [04:55:07] RECOVERY - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.095 second response time [04:55:17] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [04:56:44] (03PS1) 10Papaul: add missing asset tag forward DNS Bug:P4356 [dns] - 10https://gerrit.wikimedia.org/r/319512 [04:58:13] Puppet doesn't attempt to ensure salt-minion is running? [04:58:46] It did earlier today i thought [04:59:04] Zppix|mobile, what [04:59:15] 11:58 PM Puppet doesn't attempt to ensure salt-minion is running? [04:59:28] Yes [05:00:17] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [05:03:54] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic: CentralNotice: Review and update Varnish caching for Special:BannerLoader - https://phabricator.wikimedia.org/T149873#2767345 (10aaron) Another idea is to add a cache-busting parameter to the URLs handed out, like the cac... [05:05:17] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [05:10:27] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [05:15:37] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [05:20:27] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [05:25:29] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [05:30:29] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [05:35:19] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [05:35:19] PROBLEM - puppet last run on lvs1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:40:09] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [05:45:09] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [05:45:43] gj icinga [05:45:55] or frack [05:50:29] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [05:55:32] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [05:56:16] 06Operations: mgmt hosts that exist but don't resolve to an IP - https://phabricator.wikimedia.org/T149875#2767378 (10Peachey88) [05:57:30] (03PS2) 10Papaul: add missing asset tag forward DNS [dns] - 10https://gerrit.wikimedia.org/r/319512 (https://phabricator.wikimedia.org/T149875) [05:57:42] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [05:58:42] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [05:58:57] (03CR) 10Peachey88: "(For future noting) Bug: lines need a task number to work rather than a paste, I've created a corresponding task ticket for this." [dns] - 10https://gerrit.wikimedia.org/r/319512 (https://phabricator.wikimedia.org/T149875) (owner: 10Papaul) [06:00:32] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [06:03:05] PROBLEM - MariaDB Slave Lag: s1 on db1073 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 544.95 seconds [06:04:22] RECOVERY - puppet last run on lvs1008 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:05:32] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [06:10:32] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [06:14:41] db1073 crashed [06:15:12] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [06:16:12] (03PS2) 10Jcrespo: mariadb: Depool db1052 and db1073 once extra en api load has gone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319297 [06:16:14] jynus: "just" mysql looks like heh [06:16:30] well [06:17:26] any idea why? [06:17:27] being a dedicated machine, is not really a difference [06:17:30] yes [06:18:06] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1052 and db1073 once extra en api load has gone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319297 (owner: 10Jcrespo) [06:18:25] 161103 5:53:53 [ERROR] Got error 180 when reading table './enwiki/text' [06:19:20] (03PS2) 10Jcrespo: mariadb: Remove references to db1042 for decommission [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319298 [06:20:12] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [06:20:32] (03CR) 10Jcrespo: [C: 032] mariadb: Remove references to db1042 for decommission [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319298 (owner: 10Jcrespo) [06:22:05] what is error 180? [06:22:44] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1052; Depool db1073; Remove references to db1042 (duration: 00m 47s) [06:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:06] perror 180 -> MySQL error code 180: Index corrupted [06:23:32] oh joy [06:23:55] it has compression and a disk failed recently [06:24:08] ah the disk then [06:24:43] going afk, assuming not much to do immediately (?) [06:25:23] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [06:25:35] no, I have depooled the servers, and it was depooled automatically anyway [06:25:37] go, there's other folks here. have an evening [06:26:28] only 150 errors for less than a minute, and I assume no user impact at all [06:26:46] ok! ttyl [06:28:09] !log jynus@tin Synchronized wmf-config/db-codfw.php: Remove references to db1042 (duration: 00m 46s) [06:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:33] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [06:32:30] well, technicaly, it not crashed- internal checksum failed [06:32:41] and it kills itself to avoid inconsistencies [06:33:01] good news it is only on an index [06:34:37] would you rebuld or just replicate from scratch (in case there are other inconsistencies)? [06:35:01] we cannot replicate from scratch [06:35:33] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [06:35:43] replication started 10 years ago (unless I am missunderstanding you) [06:35:56] sorry, I mean [06:36:44] take a copy from another slave [06:36:50] I don't know the right term for that maybe [06:37:00] I use clone [06:37:09] ok. clone then... [06:37:11] but what is the other option? [06:37:25] well to rebuild only the index? I don't know if that's an option [06:37:38] hence I ask these silly questions [06:37:45] ok, that is the part I didn't understand [06:38:00] no, it is not silly [06:38:19] (03PS3) 10Ema: SystemTap Puppet module and role::systemtap::devserver [puppet] - 10https://gerrit.wikimedia.org/r/319083 [06:38:27] I was just missing the "would you rebuld [the index] or just replicate from scratch" [06:38:28] (03CR) 10Ema: [C: 032 V: 032] SystemTap Puppet module and role::systemtap::devserver [puppet] - 10https://gerrit.wikimedia.org/r/319083 (owner: 10Ema) [06:39:07] ah [06:39:15] now that it is depooled, I am going to start replication, and see if I can repeat the problem, for debugging purposes [06:39:31] I will most likely reimage it [06:39:35] !log attempting manual re-image of labstore2004 [06:39:37] ok [06:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:33] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [06:40:41] 06Operations, 10Traffic: 503 errors for users connecting to esams - https://phabricator.wikimedia.org/T149865#2767401 (10ema) p:05Triage>03High [06:41:02] (03PS2) 10Ema: cache_text: route around codfw in cache::route_table [puppet] - 10https://gerrit.wikimedia.org/r/319345 (https://phabricator.wikimedia.org/T131503) [06:41:11] (03CR) 10Ema: [C: 032 V: 032] cache_text: route around codfw in cache::route_table [puppet] - 10https://gerrit.wikimedia.org/r/319345 (https://phabricator.wikimedia.org/T131503) (owner: 10Ema) [06:42:56] I am checking first the error log to see how many pages got corrupted or how many times this happened [06:43:55] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 11 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2767405 (10KartikMistry) ``` npm test ``` for cxserver is failing for me. Debugging further. [06:44:51] * apergos waits for the bad news [06:45:33] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [06:50:13] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [06:51:48] https://phabricator.wikimedia.org/T149876 [06:53:16] (03PS1) 10Madhuvishy: labstore: Add mountpoint at srv for labstore-lvm-noraid partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/319518 (https://phabricator.wikimedia.org/T149870) [06:53:51] jynus: what is the internal ip of that db? [06:54:09] 06Operations, 10DBA, 10Wikimedia-Site-requests, 13Patch-For-Review: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2767421 (10jcrespo) a:05jcrespo>03None My part is done, process can continue. This will not reach labs. [06:54:42] (03CR) 10Madhuvishy: [C: 032 V: 032] labstore: Add mountpoint at srv for labstore-lvm-noraid partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/319518 (https://phabricator.wikimedia.org/T149870) (owner: 10Madhuvishy) [06:55:19] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [06:59:29] PROBLEM - puppet last run on hydrogen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:00:29] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [07:04:44] (03PS1) 10Giuseppe Lavagetto: puppetmaster::web_frontend: use the correct path for the CRL [puppet] - 10https://gerrit.wikimedia.org/r/319523 [07:05:29] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [07:10:29] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [07:12:44] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] puppetmaster::web_frontend: use the correct path for the CRL [puppet] - 10https://gerrit.wikimedia.org/r/319523 (owner: 10Giuseppe Lavagetto) [07:15:19] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [07:18:54] !log stopping and debugging db1073 [07:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:09] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [07:21:53] PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:25:14] (03PS2) 10Ema: cache_text: route codfw straight to applayer [puppet] - 10https://gerrit.wikimedia.org/r/319347 (https://phabricator.wikimedia.org/T131503) [07:25:22] (03CR) 10Ema: [C: 032 V: 032] cache_text: route codfw straight to applayer [puppet] - 10https://gerrit.wikimedia.org/r/319347 (https://phabricator.wikimedia.org/T131503) (owner: 10Ema) [07:25:23] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [07:28:23] RECOVERY - puppet last run on hydrogen is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [07:30:13] <_joe_> !log restarting pybal on lvs2005 [07:30:13] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [07:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:33] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [07:38:23] <_joe_> !log rolling restart of pybal in esams [07:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:23] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [07:40:26] (03PS3) 10Ema: cache_text: upgrade codfw to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/319351 (https://phabricator.wikimedia.org/T131503) [07:40:33] (03CR) 10Ema: [C: 032 V: 032] cache_text: upgrade codfw to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/319351 (https://phabricator.wikimedia.org/T131503) (owner: 10Ema) [07:42:10] !log upgrading cp2023 (text-codfw) to varnish 4 -- T131503 [07:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:17] T131503: Convert text cluster to Varnish 4 - https://phabricator.wikimedia.org/T131503 [07:45:23] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [07:46:43] PROBLEM - Varnishkafka log producer on cp2023 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [07:46:49] that's me ^ [07:47:43] RECOVERY - Varnishkafka log producer on cp2023 is OK: PROCS OK: 3 processes with command name varnishkafka [07:48:54] * elukey thanks ema [07:49:13] WOW Varnish 4 in text! [07:49:18] \o/ [07:49:53] RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [07:50:13] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [07:51:14] !log stopping mysql on db1042 [07:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:55] 06Operations, 10DBA, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#2767459 (10jcrespo) [07:54:21] !log repool cp2019 varnish-be, currently depooled for no valid reason [07:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:13] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [07:56:46] !log restarting replication codfw -> eqiad on s1 [07:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:33] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [08:01:21] !log upgrading cp2019 (text-codfw) to varnish 4 -- T131503 [08:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:27] T131503: Convert text cluster to Varnish 4 - https://phabricator.wikimedia.org/T131503 [08:05:23] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [08:10:23] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [08:15:19] (03PS1) 10Madhuvishy: labstore: Setup secondary backups of tools and misc on labstore2003/4 [puppet] - 10https://gerrit.wikimedia.org/r/319530 (https://phabricator.wikimedia.org/T149870) [08:15:23] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [08:20:25] (03PS2) 10Madhuvishy: labstore: Setup secondary backups of tools and misc on labstore2003/4 [puppet] - 10https://gerrit.wikimedia.org/r/319530 (https://phabricator.wikimedia.org/T149870) [08:20:32] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [08:22:50] (03CR) 10Madhuvishy: [C: 032] labstore: Setup secondary backups of tools and misc on labstore2003/4 [puppet] - 10https://gerrit.wikimedia.org/r/319530 (https://phabricator.wikimedia.org/T149870) (owner: 10Madhuvishy) [08:23:42] PROBLEM - Varnish HTTP text-backend - port 3128 on cp3040 is CRITICAL: connect to address 10.20.0.175 and port 3128: Connection refused [08:24:08] looking ^ [08:25:32] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [08:25:42] RECOVERY - Varnish HTTP text-backend - port 3128 on cp3040 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.242 second response time [08:25:47] <_joe_> damnn payments server [08:27:41] 06Operations, 10Gerrit, 10grrrit-wm, 13Patch-For-Review: Support restarting grrrit-wm automatically when we restart production gerrit - https://phabricator.wikimedia.org/T149609#2767568 (10Aklapper) >>! In T149609#2766021, @Zppix wrote: > @hashar As you may already know we are actually doing not that we ar... [08:30:22] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [08:32:06] <_joe_> volans: does this open a ticket automatically too? ^^ [08:32:21] (03PS2) 10Jcrespo: mariadb: Add the posibility of selecting other package versions [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/318557 (https://phabricator.wikimedia.org/T149422) [08:34:30] (03PS1) 10Madhuvishy: labstore: Remove device mounts from secondary backup servers [puppet] - 10https://gerrit.wikimedia.org/r/319531 [08:34:56] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 4 others: Integrating MediaWiki (and other services) with dynamic configuration - https://phabricator.wikimedia.org/T149617#2767569 (10Joe) This is not limited to a discovery system, and is focused on MediaWiki (as it's... [08:35:32] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [08:37:30] (03CR) 10Madhuvishy: [C: 032] labstore: Remove device mounts from secondary backup servers [puppet] - 10https://gerrit.wikimedia.org/r/319531 (owner: 10Madhuvishy) [08:38:32] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [08:38:52] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR [08:39:22] 06Operations, 10Traffic: varnish-be not restarting correctly because of disk space issues - https://phabricator.wikimedia.org/T149881#2767575 (10ema) [08:40:22] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [08:41:15] (03PS1) 10Muehlenhoff: Fix read_ahead handling, drop chapoly preference patch [debs/openssl11] - 10https://gerrit.wikimedia.org/r/319532 (https://phabricator.wikimedia.org/T144626) [08:42:36] (03CR) 10Muehlenhoff: [C: 032 V: 032] Fix read_ahead handling, drop chapoly preference patch [debs/openssl11] - 10https://gerrit.wikimedia.org/r/319532 (https://phabricator.wikimedia.org/T144626) (owner: 10Muehlenhoff) [08:45:12] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [08:46:56] _joe_ there is https://phabricator.wikimedia.org/T149646 for payments2002, but can't find auto-magic from volans [08:47:51] <_joe_> elukey: ok can you schedule downtime for that service [08:47:53] <_joe_> ? [08:47:57] !log upgrading cp2007 (text-codfw) to varnish 4 -- T131503 [08:47:58] <_joe_> referring the phab ticket [08:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:03] T131503: Convert text cluster to Varnish 4 - https://phabricator.wikimedia.org/T131503 [08:48:28] sure! [08:48:50] (03CR) 10Muehlenhoff: "Looks fine, but 316032 needs to be merged first, otherwise ferm will fail to resolve the AAAA entry." [puppet] - 10https://gerrit.wikimedia.org/r/316497 (owner: 10Dzahn) [08:50:22] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [08:50:52] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [08:51:31] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [08:51:42] !log gallium: unmounted /var/lib/jenkins/tmpfs freeing 512MBytes. Artifact from the past freeing up 512MBytes of memory [08:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:25] 06Operations, 10DNS, 10Domains, 10Traffic, and 2 others: Point wikipedia.in to 205.147.101.160 instead of URL forward - https://phabricator.wikimedia.org/T144508#2767607 (10Naveenpf) >>! In T144508#2764038, @CRoslof wrote: >>>! In T144508#2740901, @Naveenpf wrote: >> @Aklapper Can you please change title t... [08:59:46] (03PS3) 10Jcrespo: mariadb: Add the posibility of selecting other package versions [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/318557 (https://phabricator.wikimedia.org/T149422) [08:59:56] (03CR) 10jenkins-bot: [V: 04-1] mariadb: Add the posibility of selecting other package versions [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/318557 (https://phabricator.wikimedia.org/T149422) (owner: 10Jcrespo) [09:00:25] !log contint1001: preliminary transfer of jenkins history from gallium using rsync [09:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:42] (03PS4) 10Jcrespo: mariadb: Add the posibility of selecting other package versions [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/318557 (https://phabricator.wikimedia.org/T149422) [09:01:54] (03CR) 10jenkins-bot: [V: 04-1] mariadb: Add the posibility of selecting other package versions [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/318557 (https://phabricator.wikimedia.org/T149422) (owner: 10Jcrespo) [09:03:39] [WBr9SgpAMEgAAGSOaYwAAABD] 2016-11-03 09:03:06: Fatal exception of type "TimestampException" [09:05:34] (03PS5) 10Jcrespo: mariadb: Add the posibility of selecting other package versions [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/318557 (https://phabricator.wikimedia.org/T149422) [09:07:24] 06Operations, 10ops-codfw, 10DBA: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2767615 (10Marostegui) I have completed the megacli documentation: https://wikitech.wikimedia.org/wiki/MegaCli This is the diff: https://wikitech.wikimedia.org/w/index.php?title=MegaCli&type=revision&dif... [09:09:17] (03CR) 10Marostegui: [C: 031] mariadb: Add the posibility of selecting other package versions [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/318557 (https://phabricator.wikimedia.org/T149422) (owner: 10Jcrespo) [09:14:36] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2052 - https://phabricator.wikimedia.org/T149377#2750453 (10Marostegui) This is now in good state and can probably be closed: ``` Current Status: OK (for 0d 14h 21m 14s) Status Information: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:... [09:16:16] 06Operations, 10ops-codfw: es2019 crashed again - https://phabricator.wikimedia.org/T149526#2767624 (10Marostegui) I would upgrade the BIOS if we need to reboot the server anyways. [09:16:42] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2052 - https://phabricator.wikimedia.org/T149377#2767625 (10Marostegui) 05Open>03Resolved [09:16:51] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1003.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=apertium']) [09:16:55] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb1004.eqiad.wmnet (tags: ['dc=eqiad', 'cluster=scb', 'service=apertium']) [09:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:12] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2004.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=apertium']) [09:17:16] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: scb2003.codfw.wmnet (tags: ['dc=codfw', 'cluster=scb', 'service=apertium']) [09:17:17] 06Operations, 10ops-codfw, 10DBA: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2767627 (10Marostegui) 05Open>03Resolved [09:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:06] (03CR) 10Mobrovac: [C: 031] Use ordered_yaml function to render node::service deployment vars [puppet] - 10https://gerrit.wikimedia.org/r/319129 (https://phabricator.wikimedia.org/T148779) (owner: 10Ottomata) [09:23:50] (03CR) 10Jcrespo: [C: 04-1] "This doesn't work." [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/318557 (https://phabricator.wikimedia.org/T149422) (owner: 10Jcrespo) [09:24:07] !log upgrading cp2016 (text-codfw) to varnish 4 -- T131503 [09:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:13] T131503: Convert text cluster to Varnish 4 - https://phabricator.wikimedia.org/T131503 [09:25:19] 06Operations, 10Citoid, 10VisualEditor, 06Services (blocked): Separate citoid service for beta that runs off master instead of deploy - https://phabricator.wikimedia.org/T92304#2767646 (10Mvolz) So apparently we have this now? https://wikitech.wikimedia.org/wiki/Nova_Resource:Citoid.services.eqiad.wmflabs... [09:26:27] PROBLEM - puppet last run on cp2016 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 seconds ago with 2 failures. Failed resources (up to 3 shown) [09:27:27] RECOVERY - puppet last run on cp2016 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [09:28:16] 07Puppet, 06Labs, 10wikitech.wikimedia.org: Puppet failure e-mail from labs/wikitech contains wrong url to wikitech - https://phabricator.wikimedia.org/T149883#2767649 (10Mvolz) [09:29:02] hashar: o/ i was out yesterday afternoon, otherwise i would've watched/helped with the afternoon deploy :D [09:29:33] (03PS1) 10Jcrespo: Decommission/make spare db1042 [puppet] - 10https://gerrit.wikimedia.org/r/319533 (https://phabricator.wikimedia.org/T149793) [09:32:36] (03CR) 10Marostegui: [C: 031] "Looks good, as a side comment, just in case (I am sure you are aware) the server still appears on the mediawiki php config files (db-eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/319533 (https://phabricator.wikimedia.org/T149793) (owner: 10Jcrespo) [09:34:15] (03CR) 10Jcrespo: "But it does not? https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php" [puppet] - 10https://gerrit.wikimedia.org/r/319533 (https://phabricator.wikimedia.org/T149793) (owner: 10Jcrespo) [09:35:11] (03CR) 10Jcrespo: [C: 032] Decommission/make spare db1042 [puppet] - 10https://gerrit.wikimedia.org/r/319533 (https://phabricator.wikimedia.org/T149793) (owner: 10Jcrespo) [09:35:36] (03CR) 10Marostegui: "My bad, did git fetch but not git pull. You are right, it doesn't appear there" [puppet] - 10https://gerrit.wikimedia.org/r/319533 (https://phabricator.wikimedia.org/T149793) (owner: 10Jcrespo) [09:36:40] !log Deploy schema change s5 dewiki.revision - only codfw https://phabricator.wikimedia.org/T148967 [09:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:56] !log uploaded openssl 1.1.0b-1+wmf2 for jessie-wikimedia to apt.wikimedia.org (adding the read_ahead bugfix and dropping the chapoly_pref patch) [09:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:13] 06Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 13Patch-For-Review: Decommission db1042 - https://phabricator.wikimedia.org/T149793#2767673 (10jcrespo) This is ready to go. [09:39:22] 06Operations, 10Citoid, 10VisualEditor, 06Services (blocked): Separate citoid service for beta that runs off master instead of deploy - https://phabricator.wikimedia.org/T92304#2767678 (10mobrovac) Nope, @Mvolz that's not it. I will likely delete that instance as it's not really used. [09:40:24] 06Operations, 10ops-eqiad, 10DBA, 10hardware-requests: db1019: Decommission - https://phabricator.wikimedia.org/T146265#2767680 (10jcrespo) This is ready to go. [09:40:39] phuedx: no worries. There will be some other opportunity :] [09:40:41] 06Operations, 10ops-codfw: Predictive disk failure on db2047 - https://phabricator.wikimedia.org/T149670#2767684 (10Marostegui) 05Open>03Resolved This is now good: ``` root@db2047:~# hpssacli ctrl all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337E0DB0) Gen8 ServBP 12+2 at Por... [09:42:26] (03CR) 10Elukey: "PCC just to be sure: https://puppet-compiler.wmflabs.org/4528/" [puppet] - 10https://gerrit.wikimedia.org/r/319278 (https://phabricator.wikimedia.org/T137345) (owner: 10Elukey) [09:42:27] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 11 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2767686 (10mobrovac) >>! In T149331#2767405, @KartikMistry wrote: > ``` > npm test > ``` > > for cxserver is failing for me. Debugging further. Remember that after you swi... [09:43:18] 06Operations, 10DBA, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#2767687 (10jcrespo) I am assuming we want to decom all of those, based on these are the old ones that replaced the newly purchased ones. I have marked {T149793} {T146265} are read... [09:47:46] (03CR) 10Muehlenhoff: "I don't see why we'd need role::spare here, it's meant for outgoing servers? We can add them to site.pp as they're being put into service " [puppet] - 10https://gerrit.wikimedia.org/r/319278 (https://phabricator.wikimedia.org/T137345) (owner: 10Elukey) [09:48:35] 06Operations, 10DBA, 13Patch-For-Review, 05Prometheus-metrics-monitoring: implement performance_schema for mysql monitoring - https://phabricator.wikimedia.org/T99485#2767698 (10jcrespo) [09:48:39] 06Operations, 10DBA, 05Prometheus-metrics-monitoring: Decide storage backend for performance schema monitoring stats - https://phabricator.wikimedia.org/T119619#2767695 (10jcrespo) 05Open>03stalled Half of this went to public prometheus. We cannot hold there query data as it can contain PII. The solutio... [09:48:43] 06Operations, 10DBA, 10Traffic, 06WMF-Legal, and 2 others: dbtree loads third party resources (from jquery.com and google.com) - https://phabricator.wikimedia.org/T96499#2767699 (10jcrespo) [09:50:16] (03CR) 10Alexandros Kosiaris: [C: 031] "That's definitely better than both the old approach and the one proposed earlier. /me likes" [puppet] - 10https://gerrit.wikimedia.org/r/319129 (https://phabricator.wikimedia.org/T148779) (owner: 10Ottomata) [09:50:33] (03CR) 10Muehlenhoff: "Or to make them initially available in general, just add them with "include standard" as currently done with e.g. cp3022" [puppet] - 10https://gerrit.wikimedia.org/r/319278 (https://phabricator.wikimedia.org/T137345) (owner: 10Elukey) [09:53:02] 06Operations, 06Discovery, 06Maps: Investigate Swift as a storage backend for maps tiles - https://phabricator.wikimedia.org/T149885#2767720 (10Gehel) [09:53:57] (03Abandoned) 10Elukey: Add new mc* servers to site.pp with role:spare [puppet] - 10https://gerrit.wikimedia.org/r/319278 (https://phabricator.wikimedia.org/T137345) (owner: 10Elukey) [09:54:45] PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - free space: / 1633 MB (3% inode=94%) [09:55:05] contint1001 is me [09:55:09] 06Operations, 10DBA, 13Patch-For-Review, 05Prometheus-metrics-monitoring: implement performance_schema for mysql monitoring - https://phabricator.wikimedia.org/T99485#2767736 (10jcrespo) Pending ones: ``` $ sudo salt --output=txt -C 'G@mysql_group:core' cmd.run 'mysql --defaults-file=/root/.my.cnf --batch... [09:55:24] bah [10:00:42] 06Operations, 10Graphite, 10Monitoring, 10Wikimedia-General-or-Unknown: Easy way to define alerts for ganglia data - https://phabricator.wikimedia.org/T59882#2767744 (10Aklapper) >>! In T59882#958530, @chasemp wrote: > sure -- I think that's the plan but @fgiunchedi could provide more details, but I have n... [10:01:16] _joe_, elukey checking (re: raid on payments2002) [10:03:15] PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:06:45] RECOVERY - Disk space on contint1001 is OK: DISK OK [10:17:58] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T149728#2767773 (10Volans) [10:20:25] PROBLEM - puppet last run on labvirt1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:32:19] (03PS1) 10Addshore: Add wmgRevisionSliderBetaFeature (default true) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319538 [10:32:21] (03PS1) 10Addshore: Enable RevisionSlider (non beta feature) on beta sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319539 (https://phabricator.wikimedia.org/T149725) [10:32:22] RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [10:32:23] (03PS1) 10Addshore: Enable RevisionSlider (non BF) on test & mediawiki wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319540 (https://phabricator.wikimedia.org/T149724) [10:32:25] (03PS1) 10Addshore: Enable RevisionSlider (non BetaFeature) on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319541 (https://phabricator.wikimedia.org/T148646) [10:34:52] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2767815 (10Gilles) >>! In T66214#2762226, @bearND wrote: > In addition to that I'd like to know what the relationship between this task and the Thumbor... [10:39:24] 06Operations, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2767842 (10mobrovac) >>! In T149408#2763391, @Anomie wrote: > There are several ways MediaWiki provides for running jobs:... [10:48:22] RECOVERY - puppet last run on labvirt1012 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [10:51:00] 06Operations, 10Gerrit, 10grrrit-wm, 13Patch-For-Review: Support restarting grrrit-wm automatically when we restart production gerrit - https://phabricator.wikimedia.org/T149609#2767871 (10Paladox) @hashar ok it now can automatically restart the ssh connection if it detects that it was dropped. I.E. if ger... [10:55:00] !log disable puppet throughout the fleet. merging https://gerrit.wikimedia.org/r/#/c/316032/1 [10:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:54] 06Operations, 10Traffic, 06WMF-Communications, 07HTTPS, 07Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2767875 (10Florian) [11:01:26] (03CR) 10Volans: "Random comments inline, mostly style-related" (0314 comments) [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/319477 (https://phabricator.wikimedia.org/T147423) (owner: 10Filippo Giunchedi) [11:02:43] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2767879 (10Gilles) In the current examples, I think it's unfortunate that height-constraining isn't considered. Not as a feature that would be availabl... [11:02:54] mysql traffic is spiky [11:03:37] https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s1&var-role=slave&from=1478100495868&to=1478170930946 [11:05:09] (03CR) 10Alexandros Kosiaris: [C: 032] add IPv6 AAAA and PTR for all puppetmasters [dns] - 10https://gerrit.wikimedia.org/r/316032 (owner: 10Dzahn) [11:07:22] (03PS2) 10Alexandros Kosiaris: add IPv6 AAAA and PTR for all puppetmasters [dns] - 10https://gerrit.wikimedia.org/r/316032 (owner: 10Dzahn) [11:07:47] (03CR) 10Alexandros Kosiaris: [C: 032] add IPv6 AAAA and PTR for all puppetmasters [dns] - 10https://gerrit.wikimedia.org/r/316032 (owner: 10Dzahn) [11:08:10] 06Operations, 10Traffic, 06WMF-Communications, 07HTTPS, 07Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2767895 (10Florian) Update: There was a new user reporting such a problem (from the US coast guard, and he was reportin... [11:10:20] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Use content hash based image / thumb URLs - https://phabricator.wikimedia.org/T149847#2767897 (10Gilles) FYI I've sort of implement a solution for this on Vagrant a while ago for the current thumbnail URI scheme, by replaci... [11:12:16] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Use content hash based image / thumb URLs - https://phabricator.wikimedia.org/T149847#2767898 (10Gilles) The option that makes that happen is $supportsSha1URLs on the FileRepo [11:14:28] 06Operations, 10Traffic, 06WMF-Communications, 07HTTPS, 07Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2767900 (10BBlack) Keep in mind we've had the GlobalSign incident ~ 2 weeks ago, which precipitated us switching Global... [11:15:42] 06Operations, 10Traffic, 06WMF-Communications, 07HTTPS, 07Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2767902 (10BBlack) [As an aside, we'd **love** to have more-direct contact on this from someone technical inside these... [11:15:49] 06Operations, 06Discovery, 06Maps: Investigate how Kartotherian metrics are published and what they mean - https://phabricator.wikimedia.org/T149889#2767903 (10Gehel) [11:16:37] (03PS1) 10Alexandros Kosiaris: puppetmaster: Update ferm rules for IPv6 as well [puppet] - 10https://gerrit.wikimedia.org/r/319544 [11:16:40] !log depooling cp2016, cp2007, cp2019, cp2023: not caching properly (T131503) [11:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:47] T131503: Convert text cluster to Varnish 4 - https://phabricator.wikimedia.org/T131503 [11:16:54] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] puppetmaster: Update ferm rules for IPv6 as well [puppet] - 10https://gerrit.wikimedia.org/r/319544 (owner: 10Alexandros Kosiaris) [11:29:17] (03PS1) 10Jcrespo: labswiki: Add wikitech access from mira to mirror tin privileges [puppet] - 10https://gerrit.wikimedia.org/r/319546 (https://phabricator.wikimedia.org/T149186) [11:31:01] (03CR) 10Jcrespo: [C: 032] labswiki: Add wikitech access from mira to mirror tin privileges [puppet] - 10https://gerrit.wikimedia.org/r/319546 (https://phabricator.wikimedia.org/T149186) (owner: 10Jcrespo) [11:32:32] !log restbase deploy start of 1ec3b129 [11:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:30] (03PS1) 10Alexandros Kosiaris: Fix ferm typo introduced in 6ffcac5717 (parentheses) [puppet] - 10https://gerrit.wikimedia.org/r/319547 [11:37:43] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Fix ferm typo introduced in 6ffcac5717 (parentheses) [puppet] - 10https://gerrit.wikimedia.org/r/319547 (owner: 10Alexandros Kosiaris) [11:38:30] Whens next mw wmf ver deployed? I would look but im in middle of writing a big ass chunk of code [11:42:44] (03PS1) 10Elukey: First Docker prototype [software/varnish/varnishkafka/testing] - 10https://gerrit.wikimedia.org/r/319548 (https://phabricator.wikimedia.org/T147442) [11:43:02] (03PS1) 10Giuseppe Lavagetto: docker::registry: fix htpasswd location [puppet] - 10https://gerrit.wikimedia.org/r/319549 [11:45:26] (03CR) 10Giuseppe Lavagetto: [C: 032] docker::registry: fix htpasswd location [puppet] - 10https://gerrit.wikimedia.org/r/319549 (owner: 10Giuseppe Lavagetto) [11:45:45] (03PS2) 10Giuseppe Lavagetto: docker::registry: fix htpasswd location [puppet] - 10https://gerrit.wikimedia.org/r/319549 [11:45:48] (03CR) 10Giuseppe Lavagetto: [V: 032] docker::registry: fix htpasswd location [puppet] - 10https://gerrit.wikimedia.org/r/319549 (owner: 10Giuseppe Lavagetto) [11:48:03] !log reenable puppet across the fleet on hosts that I had disabled it. https://gerrit.wikimedia.org/r/#/c/316032/1 merged successfully [11:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:22] 06Operations, 10DBA: db1065 paged for NRPE timeout - https://phabricator.wikimedia.org/T149633#2767959 (10jcrespo) 05Open>03Resolved a:03jcrespo There has been several api issues in the last weeks. While the report is certainly helpful, I am resolving this because the long-term fixes are identified and... [11:49:23] (03CR) 10Alexandros Kosiaris: [C: 031] "https://gerrit.wikimedia.org/r/#/c/316032, https://gerrit.wikimedia.org/r/#/c/319544/ and https://gerrit.wikimedia.org/r/#/c/319547/ have " [puppet] - 10https://gerrit.wikimedia.org/r/316497 (owner: 10Dzahn) [11:50:26] akosiaris: did you already salt re-enable? [11:50:35] nevermind, I see it happened now [11:52:09] bblack: are you needing any assistance that i may be able to help with? [11:52:12] !log restbase deploy end of 1ec3b129 [11:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:41] bblack: just finished [11:53:04] * akosiaris sees puppetmaster connections over IPv6 :-) [11:53:16] akosiaris: great work! [11:53:30] (03PS6) 10Jcrespo: mariadb: Add the posibility of selecting other package versions [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/318557 (https://phabricator.wikimedia.org/T149422) [11:53:58] Now can you do that to my connection lol akosiaris [11:54:16] provide you with IPv6 ? [11:54:25] !log change-prop deploying 15eae87 [11:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:48] Zppix|mobile: https://www.sixxs.net/main/ ? [11:55:15] akosiaris: ah my isp doesnt support ipv6 for my package [11:55:15] if your provider does not give you IPv6 and you absolutely want it, tunnelling via sixxs is a nice alternative [11:55:16] (03PS1) 10Mobrovac: RESTBase: Remove Wikidata from the list of domains [puppet] - 10https://gerrit.wikimedia.org/r/319552 (https://phabricator.wikimedia.org/T149114) [11:55:35] And i hate setting networking for my self [12:02:06] (03CR) 10Jcrespo: [C: 031] "This looks good now: https://puppet-compiler.wmflabs.org/4529/" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/318557 (https://phabricator.wikimedia.org/T149422) (owner: 10Jcrespo) [12:02:22] 06Operations, 10DBA, 10Wikimedia-Site-requests, 13Patch-For-Review: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2767991 (10Dereckson) a:03Dereckson Thanks. Next step is Apache and DNS. If all is ready, I'll create the wiki Monday 2016-11-07. [12:03:00] (03PS2) 10Giuseppe Lavagetto: docker::baseimages: depend on service, not docker::engine [puppet] - 10https://gerrit.wikimedia.org/r/319342 [12:04:13] (03CR) 10Marostegui: [C: 031] mariadb: Add the posibility of selecting other package versions [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/318557 (https://phabricator.wikimedia.org/T149422) (owner: 10Jcrespo) [12:05:54] (03CR) 10Alexandros Kosiaris: [C: 031] RESTBase: Remove Wikidata from the list of domains [puppet] - 10https://gerrit.wikimedia.org/r/319552 (https://phabricator.wikimedia.org/T149114) (owner: 10Mobrovac) [12:06:05] mobrovac: merging ^ [12:06:11] (03CR) 10Alexandros Kosiaris: [C: 032] RESTBase: Remove Wikidata from the list of domains [puppet] - 10https://gerrit.wikimedia.org/r/319552 (https://phabricator.wikimedia.org/T149114) (owner: 10Mobrovac) [12:06:28] (03CR) 10Giuseppe Lavagetto: [C: 032] docker::baseimages: depend on service, not docker::engine [puppet] - 10https://gerrit.wikimedia.org/r/319342 (owner: 10Giuseppe Lavagetto) [12:06:33] (03CR) 10Jcrespo: [C: 032] mariadb: Add the posibility of selecting other package versions [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/318557 (https://phabricator.wikimedia.org/T149422) (owner: 10Jcrespo) [12:07:58] (03PS4) 10Alexandros Kosiaris: tcpircbot: improve firewall rule setup [puppet] - 10https://gerrit.wikimedia.org/r/316497 (owner: 10Dzahn) [12:08:06] (03PS1) 10Jcrespo: mariadb: Add the posibility of selecting other package versions [puppet] - 10https://gerrit.wikimedia.org/r/319555 (https://phabricator.wikimedia.org/T149422) [12:08:30] Is jenkins stuck again or does it ignore puppet repo [12:08:36] (03PS3) 10Giuseppe Lavagetto: docker::baseimages: depend on service, not docker::engine [puppet] - 10https://gerrit.wikimedia.org/r/319342 [12:08:38] (03PS2) 10Jcrespo: mariadb: Add the posibility of selecting other package versions [puppet] - 10https://gerrit.wikimedia.org/r/319555 (https://phabricator.wikimedia.org/T149422) [12:08:42] !log Deploying schema change s4 commonswiki.revision only codfw - https://phabricator.wikimedia.org/T147305 [12:08:43] (03CR) 10Giuseppe Lavagetto: [V: 032] docker::baseimages: depend on service, not docker::engine [puppet] - 10https://gerrit.wikimedia.org/r/319342 (owner: 10Giuseppe Lavagetto) [12:09:03] (03CR) 10Alexandros Kosiaris: [C: 04-1] tcpircbot: improve firewall rule setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/316497 (owner: 10Dzahn) [12:09:06] (03PS2) 10Giuseppe Lavagetto: docker::baseimages: do not perform hiera lookups within the module [puppet] - 10https://gerrit.wikimedia.org/r/319343 [12:09:37] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] docker::baseimages: do not perform hiera lookups within the module [puppet] - 10https://gerrit.wikimedia.org/r/319343 (owner: 10Giuseppe Lavagetto) [12:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:14] (03PS3) 10Jcrespo: mariadb: Add the posibility of selecting other package versions [puppet] - 10https://gerrit.wikimedia.org/r/319555 (https://phabricator.wikimedia.org/T149422) [12:14:54] (03CR) 10Jcrespo: [C: 031] Revert "Revert "tendril: Supply a robots.txt disallow all robots"" [puppet] - 10https://gerrit.wikimedia.org/r/319300 (https://phabricator.wikimedia.org/T149340) (owner: 10Alexandros Kosiaris) [12:15:23] (03CR) 10Alexandros Kosiaris: [C: 032] "Thanks. Merging :-)" [puppet] - 10https://gerrit.wikimedia.org/r/319300 (https://phabricator.wikimedia.org/T149340) (owner: 10Alexandros Kosiaris) [12:15:28] (03PS2) 10Alexandros Kosiaris: Revert "Revert "tendril: Supply a robots.txt disallow all robots"" [puppet] - 10https://gerrit.wikimedia.org/r/319300 (https://phabricator.wikimedia.org/T149340) [12:15:30] (03CR) 10Alexandros Kosiaris: [V: 032] Revert "Revert "tendril: Supply a robots.txt disallow all robots"" [puppet] - 10https://gerrit.wikimedia.org/r/319300 (https://phabricator.wikimedia.org/T149340) (owner: 10Alexandros Kosiaris) [12:15:44] (03CR) 10Jcrespo: [C: 032] mariadb: Add the posibility of selecting other package versions [puppet] - 10https://gerrit.wikimedia.org/r/319555 (https://phabricator.wikimedia.org/T149422) (owner: 10Jcrespo) [12:15:50] (03PS4) 10Jcrespo: mariadb: Add the posibility of selecting other package versions [puppet] - 10https://gerrit.wikimedia.org/r/319555 (https://phabricator.wikimedia.org/T149422) [12:25:26] PROBLEM - puppet last run on labvirt1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:26:37] (03PS1) 10Muehlenhoff: Update date in debian/changelog (so that it properly shows up in uname) [debs/linux44] - 10https://gerrit.wikimedia.org/r/319556 [12:26:56] PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:27:39] (03CR) 10Muehlenhoff: [C: 032] Update date in debian/changelog (so that it properly shows up in uname) [debs/linux44] - 10https://gerrit.wikimedia.org/r/319556 (owner: 10Muehlenhoff) [12:30:22] 06Operations, 06Performance-Team, 10Thumbor: thumbor imagemagick filling up /tmp on thumbor1002 - https://phabricator.wikimedia.org/T145878#2768048 (10Gilles) Just like the previous time, it seems to be IM creating this temp file and going crazy. I agree that ideally we would cap the disk usage and thumbor w... [12:30:52] 06Operations, 06Performance-Team, 10Thumbor: thumbor imagemagick filling up /tmp on thumbor1002 - https://phabricator.wikimedia.org/T145878#2768049 (10Gilles) Ah, turns out that IM has a number of limit mechanisms: http://www.imagemagick.org/script/command-line-options.php This is probably worth investigati... [12:32:35] 06Operations, 06Performance-Team, 10Thumbor: thumbor imagemagick filling up /tmp on thumbor1002 - https://phabricator.wikimedia.org/T145878#2768054 (10Gilles) The defaults are terrible: unlimited time and unlimited disk... [12:34:36] PROBLEM - puppet last run on labvirt1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:35:06] PROBLEM - puppet last run on labvirt1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:35:06] PROBLEM - puppet last run on labvirt1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:36:26] PROBLEM - puppet last run on labvirt1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:36:26] PROBLEM - puppet last run on labvirt1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:38:26] PROBLEM - puppet last run on labvirt1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:39:20] (03CR) 10Alexandros Kosiaris: [C: 04-1] Initial commit (031 comment) [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/319477 (https://phabricator.wikimedia.org/T147423) (owner: 10Filippo Giunchedi) [12:40:11] (03PS1) 10Jcrespo: Install MariaDB 10.1 on New labsdb replicas [puppet] - 10https://gerrit.wikimedia.org/r/319558 (https://phabricator.wikimedia.org/T149422) [12:40:51] (03PS2) 10Jcrespo: Add $managed flag to mariadb::service [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/318859 (owner: 10Andrew Bogott) [12:41:46] PROBLEM - puppet last run on labstore1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:42:56] PROBLEM - puppet last run on labvirt1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:43:46] PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:46:56] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:47:16] PROBLEM - puppet last run on labvirt1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:49:56] PROBLEM - puppet last run on labvirt1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:51:54] PROBLEM - puppet last run on labvirt1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:52:04] PROBLEM - puppet last run on labvirt1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161103T1300). [13:00:28] Nothing to deploy :] [13:00:29] nothing in SWAT yet, but I would like to add 1! [13:00:36] go go go !:] [13:00:52] hahaaa, would you mind doing it for me hashar ? https://gerrit.wikimedia.org/r/#/c/319538/ [13:01:22] I'll add it to the calander [13:01:27] should be a noop [13:01:43] (03PS1) 10Mobrovac: RESTBase: Remove www.wikidata.org as well [puppet] - 10https://gerrit.wikimedia.org/r/319560 (https://phabricator.wikimedia.org/T149114) [13:01:48] akosiaris: ^^^ [13:01:59] akosiaris: the former patch removed only test.wd.org :/ [13:01:59] (03CR) 10Hashar: "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319538 (owner: 10Addshore) [13:02:47] So many thanks hashar ! [13:02:57] well that enable some $wg feature switch though :D [13:02:58] https://gerrit.wikimedia.org/r/#/c/319538/1/wmf-config/CommonSettings.php [13:03:11] hashar: it defaults to tru currently ;) [13:03:20] (03CR) 10Hashar: [C: 032] Add wmgRevisionSliderBetaFeature (default true) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319538 (owner: 10Addshore) [13:03:25] but over the next 4 weeks I will slowly be setting it to false for a bunch of places! [13:03:32] ohh [13:03:46] figured we may as well get the addition of the var in a different / not busy swat! [13:03:52] (03Merged) 10jenkins-bot: Add wmgRevisionSliderBetaFeature (default true) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319538 (owner: 10Addshore) [13:04:36] !log re-enabling cr2-esams:xe-0/1/3 + cr2-eqiad:xe-4/1/3 (esams-eqiad link) [13:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:43] addshore: syncing [13:05:11] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Add wmgRevisionSliderBetaFeature (default true) (duration: 00m 47s) [13:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:08] !log hashar@tin Synchronized wmf-config/CommonSettings.php: Add wmgRevisionSliderBetaFeature (default true) (duration: 00m 46s) [13:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:16] addshore: done :] [13:06:23] great thanks! [13:06:31] and the beta config files get synced to beta automatically right? [13:06:47] yeah [13:06:54] cool, and swat is over!:D [13:06:57] the job that push to beta should report back to the change soonish [13:07:09] (03PS1) 10BBlack: Text VCL: Fix cookie handling for Varnish4 [puppet] - 10https://gerrit.wikimedia.org/r/319561 (https://phabricator.wikimedia.org/T131503) [13:09:44] RECOVERY - puppet last run on labstore1005 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [13:17:05] 06Operations, 10Gerrit, 10grrrit-wm, 13Patch-For-Review: Support restarting grrrit-wm automatically when we restart production gerrit - https://phabricator.wikimedia.org/T149609#2768207 (10Aklapper) (Please consider proofreading comments before adding them. I have no idea what "Soniasuing" is.) [13:18:16] Hi, what about enabling $wgAbuseFilterProfile at cswiki? Will it be a problem with community consensus? [13:19:30] This will be only about for 14 days... Or indef it it won't be a problem. [13:20:24] 06Operations, 10Gerrit, 10grrrit-wm, 13Patch-For-Review: Support restarting grrrit-wm automatically when we restart production gerrit - https://phabricator.wikimedia.org/T149609#2768213 (10Paladox) Oh sorry [13:20:24] PROBLEM - Apache HTTP on mw1231 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.149 second response time [13:20:24] PROBLEM - HHVM rendering on mw1231 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.152 second response time [13:21:30] RECOVERY - HHVM rendering on mw1231 is OK: HTTP OK: HTTP/1.1 200 OK - 72150 bytes in 0.300 second response time [13:21:30] RECOVERY - Apache HTTP on mw1231 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.115 second response time [13:24:48] (03PS1) 10Urbanecm: Enable wgAbuseFilterProfile at cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319566 (https://phabricator.wikimedia.org/T149899) [13:29:48] (03CR) 10Ema: [C: 031] Text VCL: Fix cookie handling for Varnish4 [puppet] - 10https://gerrit.wikimedia.org/r/319561 (https://phabricator.wikimedia.org/T131503) (owner: 10BBlack) [13:29:56] (03PS2) 10Elukey: First Docker prototype [software/varnish/varnishkafka/testing] - 10https://gerrit.wikimedia.org/r/319548 (https://phabricator.wikimedia.org/T147442) [13:31:46] !log change-prop deploying a1bd739 [13:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:49] (03PS1) 10Urbanecm: New user right and user group for et.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319568 (https://phabricator.wikimedia.org/T149610) [13:35:59] !log cp1065: upgrade libssl1.1 to 1.1.0b-1+wmf2 - T144626 - T148917 [13:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:06] T144626: Strong cipher preference ordering for cache terminators - https://phabricator.wikimedia.org/T144626 [13:36:06] T148917: Thumbnails failing to render sporadically (ERR_CONNECTION_CLOSED or ERR_SSL_BAD_RECORD_MAC_ALERT) - https://phabricator.wikimedia.org/T148917 [13:37:25] (03PS2) 10Mobrovac: RESTBase: Use the LVS Realserver role [puppet] - 10https://gerrit.wikimedia.org/r/316954 [13:38:08] (03CR) 10Mobrovac: "@_joe_ {{done}}" [puppet] - 10https://gerrit.wikimedia.org/r/316954 (owner: 10Mobrovac) [13:42:16] !log cp*: upgrade libssl1.1 to 1.1.0b-1+wmf2 (but no nginx restart yet) - T144626 - T148917 [13:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:24] T144626: Strong cipher preference ordering for cache terminators - https://phabricator.wikimedia.org/T144626 [13:42:24] T148917: Thumbnails failing to render sporadically (ERR_CONNECTION_CLOSED or ERR_SSL_BAD_RECORD_MAC_ALERT) - https://phabricator.wikimedia.org/T148917 [13:48:26] (03Draft2) 10MarcoAurelio: Enable $wgAbuseFilterProfile at Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319569 (https://phabricator.wikimedia.org/T149901) [13:49:25] !log cache_maps + cache_misc: nginx lossless restarts for libssl update - T144626 - T148917 [13:49:30] PROBLEM - Apache HTTP on mw1200 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.146 second response time [13:49:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:32] T144626: Strong cipher preference ordering for cache terminators - https://phabricator.wikimedia.org/T144626 [13:49:33] T148917: Thumbnails failing to render sporadically (ERR_CONNECTION_CLOSED or ERR_SSL_BAD_RECORD_MAC_ALERT) - https://phabricator.wikimedia.org/T148917 [13:50:30] RECOVERY - Apache HTTP on mw1200 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.098 second response time [13:51:10] /usr/local/bin/restart-hhvm did --^ [13:51:42] it is not quick enough to avoid alarms [13:56:14] !log cache_upload: nginx lossless restarts for libssl update - T144626 - T148917 [13:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:22] T144626: Strong cipher preference ordering for cache terminators - https://phabricator.wikimedia.org/T144626 [13:56:22] T148917: Thumbnails failing to render sporadically (ERR_CONNECTION_CLOSED or ERR_SSL_BAD_RECORD_MAC_ALERT) - https://phabricator.wikimedia.org/T148917 [13:56:55] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] RESTBase: Remove www.wikidata.org as well [puppet] - 10https://gerrit.wikimedia.org/r/319560 (https://phabricator.wikimedia.org/T149114) (owner: 10Mobrovac) [13:57:14] mobrovac: ^ merged [13:57:50] akosiaris: thnx! [13:59:21] PROBLEM - Varnish HTTP text-backend - port 3128 on cp4009 is CRITICAL: connect to address 10.128.0.109 and port 3128: Connection refused [14:00:05] !log restbase rolling restart for T149114 [14:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:13] T149114: Reconsider wikidata support in the REST API - https://phabricator.wikimedia.org/T149114 [14:00:51] (03PS3) 10Jcrespo: Add $managed flag to mariadb::service [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/318859 (owner: 10Andrew Bogott) [14:01:21] RECOVERY - Varnish HTTP text-backend - port 3128 on cp4009 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.148 second response time [14:02:00] 06Operations, 06Performance-Team, 10Thumbor: thumbor imagemagick filling up /tmp on thumbor1002 - https://phabricator.wikimedia.org/T145878#2768356 (10Gilles) Can't get the limits to work via wand. Systemd seems to have the options we need: https://www.freedesktop.org/software/systemd/man/systemd.exec.html#... [14:02:35] (03CR) 10Jcrespo: [C: 031] Add $managed flag to mariadb::service [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/318859 (owner: 10Andrew Bogott) [14:03:38] (03CR) 10Marostegui: [C: 031] "This means replication will be started too, right?" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/318859 (owner: 10Andrew Bogott) [14:04:27] (03PS8) 10Ottomata: Use ordered_yaml function to render node::service deployment vars [puppet] - 10https://gerrit.wikimedia.org/r/319129 (https://phabricator.wikimedia.org/T148779) [14:05:35] (03PS4) 10Jcrespo: Labs dns: Ensure the mysql server starts at boot [puppet] - 10https://gerrit.wikimedia.org/r/318572 (owner: 10Andrew Bogott) [14:06:44] (03CR) 10Jcrespo: [C: 031] "I've asked around, and people seem ok with not using base::service_unit for now and using the common mariadb::service." [puppet] - 10https://gerrit.wikimedia.org/r/318572 (owner: 10Andrew Bogott) [14:06:59] (03PS2) 10Giuseppe Lavagetto: role::builder: add docker support [puppet] - 10https://gerrit.wikimedia.org/r/319344 (https://phabricator.wikimedia.org/T149812) [14:08:13] !log cache_text: nginx lossless restarts for libssl update - T144626 - T148917 [14:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:22] T144626: Strong cipher preference ordering for cache terminators - https://phabricator.wikimedia.org/T144626 [14:08:22] T148917: Thumbnails failing to render sporadically (ERR_CONNECTION_CLOSED or ERR_SSL_BAD_RECORD_MAC_ALERT) - https://phabricator.wikimedia.org/T148917 [14:10:11] PROBLEM - check_raid on lutetium is CRITICAL: CRITICAL: MegaSAS 1 logical, 2 physical: a0/v0 (2 disk array) degraded [14:13:29] (03PS1) 10Jcrespo: analytics-meta: Manage mariadb service through mariadb::service class [puppet] - 10https://gerrit.wikimedia.org/r/319570 [14:14:38] !log temporarily stop exim on fermium for mailman update [14:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:11] (03CR) 10jenkins-bot: [V: 04-1] analytics-meta: Manage mariadb service through mariadb::service class [puppet] - 10https://gerrit.wikimedia.org/r/319570 (owner: 10Jcrespo) [14:15:11] PROBLEM - check_raid on lutetium is CRITICAL: CRITICAL: MegaSAS 1 logical, 2 physical: a0/v0 (2 disk array) degraded [14:17:30] !log exim reenabled on fermium after mailman update [14:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:03] 06Operations, 06Performance-Team, 10Thumbor: thumbor imagemagick filling up /tmp on thumbor1002 - https://phabricator.wikimedia.org/T145878#2768398 (10Gilles) Actually I think that the systemd limit is undesirable as the limit would be over the lifetime of the Thumbor process. I.e. it would cause Thumbor pro... [14:18:43] (03PS1) 10Jcrespo: Beta: auto-start mysql on beta so it comes back after reboot [puppet] - 10https://gerrit.wikimedia.org/r/319572 [14:19:21] PROBLEM - HHVM rendering on mw1206 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.146 second response time [14:19:25] 06Operations, 06Performance-Team, 10Thumbor: Make Thumbor IM engine based on a subprocess - https://phabricator.wikimedia.org/T149903#2768408 (10Gilles) [14:19:50] (03PS2) 10Jcrespo: analytics-meta: Manage mariadb service through mariadb::service class [puppet] - 10https://gerrit.wikimedia.org/r/319570 [14:20:11] PROBLEM - check_raid on lutetium is CRITICAL: CRITICAL: MegaSAS 1 logical, 2 physical: a0/v0 (2 disk array) degraded [14:20:21] RECOVERY - HHVM rendering on mw1206 is OK: HTTP OK: HTTP/1.1 200 OK - 72151 bytes in 2.717 second response time [14:21:12] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: lutetium RAID disk failed - https://phabricator.wikimedia.org/T149904#2768426 (10Jgreen) [14:21:46] (03PS2) 10Jcrespo: Beta: auto-start mysql on beta so it comes back after reboot [puppet] - 10https://gerrit.wikimedia.org/r/319572 [14:21:48] ACKNOWLEDGEMENT - check_raid on lutetium is CRITICAL: CRITICAL: MegaSAS 1 logical, 2 physical: a0/v0 (2 disk array) degraded Jeff_Green ticketed: T149904 [14:25:58] (03CR) 10Jcrespo: "Replication automatic restart is not controlled here, but on configuration. This is thought for beta, labs-dns and analytics-meta, and tho" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/318859 (owner: 10Andrew Bogott) [14:28:53] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 626 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3060099 keys, up 3 days 6 hours - replication_delay is 626 [14:30:53] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3043277 keys, up 3 days 6 hours - replication_delay is 16 [14:33:54] 06Operations, 06Performance-Team, 10Thumbor, 15User-Joe: Thumbor instances exit with exit code 0 even when crashing/failing - https://phabricator.wikimedia.org/T149560#2756079 (10MoritzMuehlenhoff) Yeah, that firejail bug was fixed in an earlier version. [14:33:58] 06Operations, 10Traffic, 06WMF-Communications, 07HTTPS, 07Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2768462 (10Florian) >>! In T128182#2767902, @BBlack wrote: > [As an aside, we'd **love** to have more-direct contact on... [14:34:33] labvirt10* is still in puppetfail state for ~2h now [14:34:56] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: lutetium RAID disk failed - https://phabricator.wikimedia.org/T149904#2768465 (10Jgreen) p:05Triage>03High [14:37:59] (03PS3) 10Giuseppe Lavagetto: role::builder: add docker support [puppet] - 10https://gerrit.wikimedia.org/r/319344 (https://phabricator.wikimedia.org/T149812) [14:40:38] (03PS2) 10BBlack: ssl_ciphersuite: commentary update re: chapoly [puppet] - 10https://gerrit.wikimedia.org/r/316890 (https://phabricator.wikimedia.org/T144626) [14:41:19] (03PS1) 10Alexandros Kosiaris: Revert "icinga: switch tegmen and einsteinium roles" [puppet] - 10https://gerrit.wikimedia.org/r/319578 [14:41:36] (03CR) 10BBlack: [C: 032 V: 032] ssl_ciphersuite: commentary update re: chapoly [puppet] - 10https://gerrit.wikimedia.org/r/316890 (https://phabricator.wikimedia.org/T144626) (owner: 10BBlack) [14:45:05] (03PS1) 10Alexandros Kosiaris: Revert "switch over einsteinium to tegmen" [dns] - 10https://gerrit.wikimedia.org/r/319579 [14:46:07] (03PS4) 10Giuseppe Lavagetto: role::builder: add docker support [puppet] - 10https://gerrit.wikimedia.org/r/319344 (https://phabricator.wikimedia.org/T149812) [14:48:33] PROBLEM - Memcached on mc2001 is CRITICAL: connect to address 10.192.0.34 and port 11211: Connection refused [14:49:45] * elukey sees moritzm playing with the memcached patch [14:51:31] RECOVERY - Memcached on mc2001 is OK: TCP OK - 0.010 second response time on 10.192.0.34 port 11211 [14:52:17] (03PS2) 10Alexandros Kosiaris: Revert "icinga: switch tegmen and einsteinium roles" [puppet] - 10https://gerrit.wikimedia.org/r/319578 [14:52:20] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Revert "icinga: switch tegmen and einsteinium roles" [puppet] - 10https://gerrit.wikimedia.org/r/319578 (owner: 10Alexandros Kosiaris) [14:52:50] 06Operations, 10ops-eqiad, 10DBA: Multiple hardware issues on db1073 - https://phabricator.wikimedia.org/T149728#2768489 (10jcrespo) [14:53:10] 06Operations, 10Traffic, 13Patch-For-Review: Strong cipher preference ordering for cache terminators - https://phabricator.wikimedia.org/T144626#2768492 (10BBlack) Update: We're now preferring chapoly to other symmetric algorithms outright in our strongest cipher suites at the top of the list, without prefe... [14:55:51] 06Operations, 10ops-eqiad, 10DBA: Multiple hardware issues on db1073 - https://phabricator.wikimedia.org/T149728#2760850 (10jcrespo) https://phabricator.wikimedia.org/T149728#2768489 See kernel.log: {P4360} Both the thermal issues happening and the I/O errors, causing data corruption: ``` Nov 3 12:55:29... [14:56:16] PROBLEM - puppet last run on labvirt1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:56:16] PROBLEM - puppet last run on labvirt1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:56:26] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:56:26] PROBLEM - puppet last run on labvirt1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:56:26] PROBLEM - puppet last run on labvirt1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:56:36] PROBLEM - puppet last run on labvirt1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:56:36] PROBLEM - puppet last run on labvirt1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:56:36] PROBLEM - puppet last run on labvirt1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:56:36] PROBLEM - puppet last run on labvirt1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:56:42] ignore these ^ [14:57:01] !log failover icinga from tegmen to einsteinium [14:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:32] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Revert "switch over einsteinium to tegmen" [dns] - 10https://gerrit.wikimedia.org/r/319579 (owner: 10Alexandros Kosiaris) [14:57:46] PROBLEM - puppet last run on labvirt1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:57:56] PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:58:04] 06Operations, 10ops-eqiad, 10DBA: Multiple hardware issues on db1073 - https://phabricator.wikimedia.org/T149728#2760850 (10Marostegui) The original report raid report showed this information for disk 32:4 ``` Raw Size: 558.911 GB [0x45dd2fb0 Sectors] Firmware state: =====> Failed <===== Media Type:... [14:58:26] PROBLEM - puppet last run on labvirt1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:58:36] PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:58:56] PROBLEM - MegaRAID on db1073 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [14:59:06] PROBLEM - puppet last run on labvirt1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:59:30] PROBLEM - MariaDB Slave Lag: s1 on db1073 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 32728.22 seconds [14:59:30] PROBLEM - puppet last run on labvirt1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:00:06] PROBLEM - check_raid on lutetium is CRITICAL: CRITICAL: MegaSAS 1 logical, 2 physical: a0/v0 (2 disk array) degraded [15:00:16] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [15:00:47] We are aware of db1073 and discussing it on #wikimedia-databases (looks like HW problems) - T149728 [15:00:58] T149728: Multiple hardware issues on db1073 - https://phabricator.wikimedia.org/T149728 [15:01:18] marostegui: yeah, I think it paged just because of the migration from tegmen to einsteinium [15:01:42] Yes, because I checked and it is downtimed [15:01:49] Thanks for the heads up :) [15:02:00] who did remove db1073 alerts disabling? [15:02:08] not me [15:02:12] * marostegui steps back [15:02:40] whoever it was, it created a page for ongoing issues [15:03:07] jynus: I 've failed icinga back from tegmen to einsteinium as I said [15:03:12] it's probably an aftermath [15:03:23] funnily enough I did transfer the damn state as well [15:03:33] so, it should not have paged ... [15:03:37] ok, so it was not on purpose, no problem [15:05:06] PROBLEM - check_raid on lutetium is CRITICAL: CRITICAL: MegaSAS 1 logical, 2 physical: a0/v0 (2 disk array) degraded [15:05:16] PROBLEM - check_raid on payments2002 is CRITICAL: CRITICAL: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 Interim Recovery Mode, phy_2I:1:2: Failed] [15:06:49] marostegui, check db2034 monitoring config, when you can [15:07:28] (03CR) 10Hashar: [C: 031] zuul::merger: switch gearman server to contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/318252 (https://phabricator.wikimedia.org/T95757) (owner: 10Dzahn) [15:07:32] hmm why on earth was not the state transfered [15:07:52] "Error: Could not open command file '/var/lib/nagios/rw/nagios.cmd' for update!" [15:08:00] there are service issues, akosiaris [15:08:04] that is probably it [15:08:21] madhuvishy: hello if you are ready let me know [15:08:55] papaul: Hi! yes I'm around, ready to go :) [15:09:13] ok are we good to poweroff labstore2001? [15:09:28] papaul: yup [15:09:33] jynus: ok I think I transferred the state as well [15:09:37] this time around at elast [15:09:38] least* [15:09:45] do we have anything important running now on labstore2002,2003 or 2004? [15:09:58] lemme look at the permissions problem as well [15:10:17] now it is ok, akosiaris (I assume that was the original server) [15:10:18] papaul: 2002 and 2003, no. 2004 I reimaged and set up yesterday to be a backup of 2001 [15:10:40] but I can kill the backup process and we can shut it down if needed [15:11:02] PROBLEM - Host lvs1008 is DOWN: PING CRITICAL - Packet loss = 100% [15:11:02] PROBLEM - Host lvs1009 is DOWN: PING CRITICAL - Packet loss = 100% [15:11:02] PROBLEM - Host lvs1007 is DOWN: PING CRITICAL - Packet loss = 100% [15:11:09] ah finally [15:11:11] there we go [15:11:16] madhuvishy: no i just need one servers maybe 2002 i need to pull a disk from there and put in labstore2001 [15:11:19] ok the state was not well transfered, my fault [15:11:35] papaul: yeah 2002 isn't being used [15:11:38] madhuvishy: if i get the disk order for 2001 i can replace it [15:11:47] madhuvishy: cool [15:11:49] papaul: yeah alright [15:12:06] (03PS1) 10Hashar: contint: enable zuul::server on contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/319584 [15:12:20] mdholloway: shuting down 2001 now [15:12:42] ACKNOWLEDGEMENT - Host lvs1007 is DOWN: PING CRITICAL - Packet loss = 100% alexandros kosiaris https://phabricator.wikimedia.org/T112781 [15:12:42] ACKNOWLEDGEMENT - Host lvs1008 is DOWN: PING CRITICAL - Packet loss = 100% alexandros kosiaris https://phabricator.wikimedia.org/T112781 [15:12:43] ACKNOWLEDGEMENT - Host lvs1009 is DOWN: PING CRITICAL - Packet loss = 100% alexandros kosiaris https://phabricator.wikimedia.org/T112781 [15:13:06] papaul: 👍 [15:14:25] ACKNOWLEDGEMENT - Check rp_filter disabled on lvs1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. alexandros kosiaris https://phabricator.wikimedia.org/T104458 [15:14:25] ACKNOWLEDGEMENT - DPKG on lvs1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. alexandros kosiaris https://phabricator.wikimedia.org/T104458 [15:14:25] ACKNOWLEDGEMENT - Disk space on lvs1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. alexandros kosiaris https://phabricator.wikimedia.org/T104458 [15:14:25] ACKNOWLEDGEMENT - HP RAID on lvs1007 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. alexandros kosiaris https://phabricator.wikimedia.org/T104458 [15:14:25] ACKNOWLEDGEMENT - NTP on lvs1007 is CRITICAL: NTP CRITICAL: No response from NTP server alexandros kosiaris https://phabricator.wikimedia.org/T104458 [15:16:12] RECOVERY - HHVM rendering on mw1201 is OK: HTTP OK: HTTP/1.1 200 OK - 72150 bytes in 0.196 second response time [15:16:16] madhuvishy: shuting down 2002 [15:16:27] papaul: okay [15:18:37] madhuvishy: disk replacement complete powering 2001 back up to checkk the controller this will take a minute [15:19:20] papaul: ok [15:19:23] 06Operations, 10ops-eqiad, 10DBA: Multiple hardware issues on db1073 - https://phabricator.wikimedia.org/T149728#2768544 (10jcrespo) ```lines=10 Severity Date and Time Message ID Summary Comment 2016-11-03T15:15:04-0500 USR0030 Successfully logged in using root, from 10.64.48.28 and GUI. 2016... [15:21:07] madhuvishy: so to setting 2001 to HW raid1 with all 12 disks correct? [15:22:12] papaul: so there are only 12 disks? [15:22:43] for the H700 controllor [15:22:53] ah [15:23:59] and the H800 controller? [15:24:14] not there yet [15:25:13] madhuvishy: i wanted to make sure that H700 can see first all the 12 disks [15:25:32] papaul: yes can confirm raid1 for all 12 disks [15:25:36] the H800 will have the external storage with 60 disks [15:26:26] papaul: okay, and all these 12 disks connected currently to H700 are doing okay? [15:26:28] !log scb in eqiad disabling puppet [15:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:46] (you just replaced one of course) [15:26:51] madhuvishy: i can see all the 12 diks on H700 [15:27:01] papaul: okay cool [15:27:24] 06Operations, 10ops-eqiad, 10DBA: Multiple hardware issues on db1073 - https://phabricator.wikimedia.org/T149728#2768559 (10jcrespo) So the plan is: #1 install new disk #2 correct cpu thermal issues #3 redo the RAID from 0 because of this "uncorrectable error" Ok with that plan @Cmjohnson ? This is NOT an... [15:27:37] (03PS3) 10Rush: bigbrother: Rewrite as python script [puppet] - 10https://gerrit.wikimedia.org/r/309216 (https://phabricator.wikimedia.org/T144955) (owner: 10BryanDavis) [15:27:39] (03CR) 10Ottomata: [C: 032] Use ordered_yaml function to render node::service deployment vars [puppet] - 10https://gerrit.wikimedia.org/r/319129 (https://phabricator.wikimedia.org/T148779) (owner: 10Ottomata) [15:27:43] (03PS9) 10Ottomata: Use ordered_yaml function to render node::service deployment vars [puppet] - 10https://gerrit.wikimedia.org/r/319129 (https://phabricator.wikimedia.org/T148779) [15:27:47] (03CR) 10Ottomata: [V: 032] Use ordered_yaml function to render node::service deployment vars [puppet] - 10https://gerrit.wikimedia.org/r/319129 (https://phabricator.wikimedia.org/T148779) (owner: 10Ottomata) [15:28:13] 06Operations, 10ops-eqiad, 10DBA: Multiple hardware issues on db1073 - https://phabricator.wikimedia.org/T149728#2768564 (10Cmjohnson) @jcrespo works for me. [15:28:40] 06Operations, 06Labs, 10netops, 05Prometheus-metrics-monitoring: Firewall rules production/labs for prometheus-node-exporter - https://phabricator.wikimedia.org/T149253#2768566 (10fgiunchedi) [15:29:51] (03CR) 10Alexandros Kosiaris: "general premise looks fine, I am still struggling with a few bits here and there but that's me still reading a bit on our docker classes. " [puppet] - 10https://gerrit.wikimedia.org/r/319344 (https://phabricator.wikimedia.org/T149812) (owner: 10Giuseppe Lavagetto) [15:29:55] (03PS1) 10Faidon Liambotis: docker: use mirrors.wm.org, not ubuntu.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/319588 [15:31:18] (03CR) 10Faidon Liambotis: [C: 032] docker: use mirrors.wm.org, not ubuntu.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/319588 (owner: 10Faidon Liambotis) [15:32:07] madhuvishy: ok i can see the H800 controllor but for now we have no disks connected to it since i disconnect all the external strorage arrays [15:32:57] papaul: see as in installed, or is it also being recognized by labstore2001? [15:34:22] madhuvishy:didn't get the question [15:36:07] papaul: when chasemp ran sudo megacli -AdpAllInfo -aAll | grep -i PERC, he mentioned on the ticket it only listed Product Name : PERC H700 Integrate. I am asking if now, both H700 and H800 show up in controller info, or is it just physically installed but not being recognized still [15:36:49] madhuvishy: that was within the OS [15:37:32] physically i can access th H800 manager interface [15:38:07] papaul: ah right okay [15:38:31] madhuvishy: the first step was to confirm that we can see both controllers [15:39:20] papaul: agreed. [15:39:51] !log scb in eqiad enabled puppet back [15:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:42] papaul: okay i think at this point I should reimage the box, and then we can move on to test the status of the external storage arrays [15:41:12] !log installing memcached security updates on graphite hosts [15:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:51] (03CR) 10Dzahn: [C: 032] add missing asset tag forward DNS [dns] - 10https://gerrit.wikimedia.org/r/319512 (https://phabricator.wikimedia.org/T149875) (owner: 10Papaul) [15:42:08] (03PS3) 10Dzahn: add missing asset tag forward DNS [dns] - 10https://gerrit.wikimedia.org/r/319512 (https://phabricator.wikimedia.org/T149875) (owner: 10Papaul) [15:42:29] madhuvishy: do we know what patman receipt we are going to use for the system? [15:43:29] papaul: yeah we can use labstore-lvm-noraid [15:44:12] PROBLEM - HHVM rendering on mw1232 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.001 second response time [15:44:51] madhuvishy: cool can you put that in netboot.cfg ? [15:45:12] RECOVERY - HHVM rendering on mw1232 is OK: HTTP OK: HTTP/1.1 200 OK - 72170 bytes in 0.093 second response time [15:46:07] 06Operations, 13Patch-For-Review: mgmt hosts that exist but don't resolve to an IP - https://phabricator.wikimedia.org/T149875#2767378 (10Dzahn) other changes related to this ticket: https://gerrit.wikimedia.org/r/319396 https://gerrit.wikimedia.org/r/319495 https://gerrit.wikimedia.org/r/319490 https://gerri... [15:46:26] 06Operations, 13Patch-For-Review: mgmt hosts that exist but don't resolve to an IP - https://phabricator.wikimedia.org/T149875#2768630 (10Dzahn) a:03Papaul [15:46:47] papaul: yeah, doing right now [15:47:17] CI/Jenkins is going down in a few minutes for a scheduled maintenance [15:48:22] okay! [15:48:32] madhuvishy: thanks [15:49:09] hashar: i'm here [15:49:35] madhuvishy: just to confirm before the install H800 is seeing right now 24 disks i just wanted to make sure before we start the install and i will disconnect that before the install [15:50:01] 06Operations, 10Ops-Access-Requests: Access to fluorine for viewing logs (wm-log-reader) - https://phabricator.wikimedia.org/T149832#2768635 (10MoritzMuehlenhoff) [15:50:05] (03PS1) 10Madhuvishy: labstore: Configure labstore-lvm-raid partman recipe for labstore2001 [puppet] - 10https://gerrit.wikimedia.org/r/319591 (https://phabricator.wikimedia.org/T102626) [15:50:09] 06Operations, 10ops-eqiad, 10DBA: db1051 disk is about to fail - https://phabricator.wikimedia.org/T149908#2768636 (10jcrespo) ``` Enclosure Device ID: 32 Slot Number: 8 Drive's position: DiskGroup: 0, Span: 4, Arm: 0 Enclosure position: 1 Device Id: 8 WWN: 5000C5005ABB3B30 Sequence Number: 2 Media Error Cou... [15:50:46] (03PS5) 10Giuseppe Lavagetto: role::builder: add docker support [puppet] - 10https://gerrit.wikimedia.org/r/319344 (https://phabricator.wikimedia.org/T149812) [15:50:59] papaul: https://gerrit.wikimedia.org/r/#/c/319591 [15:51:43] <_joe_> CI broken again? [15:51:54] _joe_: down for maint iirc [15:52:17] <_joe_> https://cdn.meme.am/instances/56608824.jpg [15:52:21] migration of gallium to contint1001 [15:52:24] papaul: yup sounds good [15:52:32] (03CR) 10Giuseppe Lavagetto: [C: 032] role::builder: add docker support [puppet] - 10https://gerrit.wikimedia.org/r/319344 (https://phabricator.wikimedia.org/T149812) (owner: 10Giuseppe Lavagetto) [15:52:36] starts in 8 minutes [15:52:41] <_joe_> mutante: oh ok [15:52:49] <_joe_> I thought it was yesterday, heh [15:52:56] (03PS1) 10Madhuvishy: labstore: Remove nfs backup role from labstore1001-2 [puppet] - 10https://gerrit.wikimedia.org/r/319592 [15:53:15] heh, yes, that was my mistake [15:53:29] it's today, time to merge quick :) [15:53:40] madhuvishy: ^ :) [15:53:49] 06Operations, 10ops-codfw: es2019 crashed again - https://phabricator.wikimedia.org/T149526#2768649 (10MoritzMuehlenhoff) a:03Papaul [15:54:55] (03PS2) 10Madhuvishy: labstore: Configure labstore-lvm-raid partman recipe for labstore2001 [puppet] - 10https://gerrit.wikimedia.org/r/319591 (https://phabricator.wikimedia.org/T102626) [15:55:01] (03PS1) 10Jcrespo: mariadb: Reduce db1051 load, it has hardware issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319593 [15:55:37] chasemp: were you going to say I should remove it? :) [15:55:42] ^I am about to deploy mediawiki-config [15:55:48] no just merge quick madhuvishy heh [15:55:52] (03CR) 10Madhuvishy: [C: 032] labstore: Remove nfs backup role from labstore1001-2 [puppet] - 10https://gerrit.wikimedia.org/r/319592 (owner: 10Madhuvishy) [15:56:03] (03PS2) 10Madhuvishy: labstore: Remove nfs backup role from labstore1001-2 [puppet] - 10https://gerrit.wikimedia.org/r/319592 [15:56:06] (03CR) 10Madhuvishy: [V: 032] labstore: Remove nfs backup role from labstore1001-2 [puppet] - 10https://gerrit.wikimedia.org/r/319592 (owner: 10Madhuvishy) [15:56:33] (03CR) 10Jcrespo: [C: 032] mariadb: Reduce db1051 load, it has hardware issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319593 (owner: 10Jcrespo) [15:56:51] (03PS3) 10Gehel: maps - create postgresql database for tiles storage [puppet] - 10https://gerrit.wikimedia.org/r/318954 (https://phabricator.wikimedia.org/T147223) [15:56:54] 06Operations, 10Graphite, 10Monitoring, 10Wikimedia-General-or-Unknown: Easy way to define alerts for ganglia data - https://phabricator.wikimedia.org/T59882#2768652 (10fgiunchedi) 05stalled>03Invalid @Aklapper no, we're deprecating ganglia so this is invalid now [15:57:12] (03PS3) 10Madhuvishy: labstore: Configure labstore-lvm-raid partman recipe for labstore2001 [puppet] - 10https://gerrit.wikimedia.org/r/319591 (https://phabricator.wikimedia.org/T102626) [15:57:16] mutante: I have synced all data between jenkins and contint1001 earlier today [15:57:44] great! [15:57:46] (03CR) 10Madhuvishy: [C: 032 V: 032] labstore: Configure labstore-lvm-raid partman recipe for labstore2001 [puppet] - 10https://gerrit.wikimedia.org/r/319591 (https://phabricator.wikimedia.org/T102626) (owner: 10Madhuvishy) [15:57:50] madhuvishy: give me a minute going backk on H700 to create the RAID 1 [15:57:57] 06Operations, 10ops-eqiad, 10DBA: db1051 disk is about to fail - https://phabricator.wikimedia.org/T149908#2768655 (10MoritzMuehlenhoff) a:03Cmjohnson [15:58:11] papaul: ah okay [15:58:14] PROBLEM - puppet last run on copper is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 26 seconds ago with 2 failures. Failed resources (up to 3 shown): Package[docker-engine],File[/var/lib/docker/devicemapper] [15:58:52] so I guess we just have to stop puppet/jenkins/zuul on gallium, last minute rsync to catch up then bring up services on contint1001 [15:59:03] !log jynus@tin Synchronized wmf-config/db-eqiad.php: mariadb: Reduce db1051 load, it has hardware issues (duration: 00m 47s) [15:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:05] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161103T1600). Please do the needful. [16:00:05] hashar, thcipriani, and mutante: Respected human, time to deploy CI Migration (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161103T1600). Please do the needful. [16:00:07] o/ [16:00:09] https://docs.google.com/document/d/1xOcXkQA9gJaLAeyA6pePUJPZmV62RFU3KapGg8LCJ_A/edit# [16:00:11] hashar: should i start with "contint: enable zuul::server on contint1001" ? [16:00:14] na [16:00:21] lets stop services first [16:00:23] ok [16:00:41] stopping puppet/jenkins/zuul on gallium [16:01:07] just pull the cable outta the plug [16:01:09] :P [16:02:00] bunch of rsync needed now [16:02:09] mutante: you can land the patch [16:02:16] will bring up zuul on contint1001 (hopefully) [16:02:22] ok [16:02:31] (03PS2) 10Dzahn: contint: enable zuul::server on contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/319584 (owner: 10Hashar) [16:02:34] but maybe we want to stop it [16:02:55] I found a nice oddity earlier this morning [16:03:07] (03PS3) 10Hashar: contint: enable zuul::server on contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/319584 (https://phabricator.wikimedia.org/T95757) [16:03:10] which is that the path where Jenkins saves the build history is different between hosts :( [16:03:12] (03CR) 10Gehel: [C: 032] maps - create postgresql database for tiles storage [puppet] - 10https://gerrit.wikimedia.org/r/318954 (https://phabricator.wikimedia.org/T147223) (owner: 10Gehel) [16:03:14] (03PS4) 10Gehel: maps - create postgresql database for tiles storage [puppet] - 10https://gerrit.wikimedia.org/r/318954 (https://phabricator.wikimedia.org/T147223) [16:03:31] rsyncing the Jenkins config files ( rsync --delete --info=progress2 -az --exclude=builds/* rsync://gallium.wikimedia.org/jenkins /var/lib/jenkins [16:03:31] ) [16:03:34] PROBLEM - jenkins_zmq_publisher on gallium is CRITICAL: Connection refused [16:03:40] (03CR) 10Dzahn: [C: 032 V: 032] contint: enable zuul::server on contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/319584 (https://phabricator.wikimedia.org/T95757) (owner: 10Hashar) [16:03:54] PROBLEM - jenkins_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [16:04:02] hashar: landed on master [16:04:04] PROBLEM - zuul_gearman_service on gallium is CRITICAL: Connection refused [16:04:05] neat [16:04:07] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=gallium [16:04:12] guess all can be acked now [16:04:14] PROBLEM - zuul_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server [16:04:16] including the uppet last run [16:04:35] i'm doing icinga. you run puppet on contint1001 [16:04:41] (03PS1) 10Giuseppe Lavagetto: role::builder: configure docker repository [puppet] - 10https://gerrit.wikimedia.org/r/319594 [16:04:50] thcipriani: do you have root on contint1001 ? [16:05:39] scheduled downtime for all gallium services for a couple days [16:05:42] (puppet ran, zuul is not started by puppet :] ) [16:05:52] (03PS2) 10Giuseppe Lavagetto: role::builder: configure docker repository [puppet] - 10https://gerrit.wikimedia.org/r/319594 [16:05:53] 06Operations, 06Discovery, 06Maps: Investigate Swift as a storage backend for maps tiles - https://phabricator.wikimedia.org/T149885#2767720 (10fgiunchedi) What sorts of traffic and object size/numbers are we talking about for swift? swift in codfw could be used for tests for this, it has the same specs as... [16:05:58] hashar: nope, but I can run puppet-run [16:06:10] and I can start zuul [16:06:18] that is a start :] [16:06:36] er, no, just zuul-merger, which is not on this machine :( [16:06:39] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] role::builder: configure docker repository [puppet] - 10https://gerrit.wikimedia.org/r/319594 (owner: 10Giuseppe Lavagetto) [16:06:40] ACKNOWLEDGEMENT - jenkins_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war daniel_zahn T95757 [16:06:40] ACKNOWLEDGEMENT - jenkins_zmq_publisher on gallium is CRITICAL: Connection refused daniel_zahn T95757 [16:06:40] ACKNOWLEDGEMENT - zuul_gearman_service on gallium is CRITICAL: Connection refused daniel_zahn T95757 [16:06:41] ACKNOWLEDGEMENT - zuul_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server daniel_zahn T95757 [16:07:03] we have a patch to switch the zuul:merger [16:07:25] zuul::merger: switch gearman server to contint1001 [16:07:29] you want that merged next? [16:07:43] yeah [16:07:53] (03PS2) 10Dzahn: zuul::merger: switch gearman server to contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/318252 (https://phabricator.wikimedia.org/T95757) [16:08:01] rsyncing the build history ( rsync --delete --info=progress2 -az rsync://gallium.wikimedia.org/jenkins/builds /srv/jenkins/builds ) [16:08:14] (03CR) 10Dzahn: [C: 032 V: 032] zuul::merger: switch gearman server to contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/318252 (https://phabricator.wikimedia.org/T95757) (owner: 10Dzahn) [16:08:15] and we can land the patch for nodepool [16:08:21] on labnodepool1001.eqiad.wmnet [16:08:33] I cant remember off hand how nodepool reach out to jenkins, probably via misc varnish [16:09:06] (03CR) 10Hashar: [C: 031] nodepool: point to Jenkins on contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/313599 (https://phabricator.wikimedia.org/T95757) (owner: 10Hashar) [16:09:18] cache::misc: switch gallium to contint1001 [16:09:27] hashar: ^ that? [16:09:32] thcipriani: wanna baby sit nodepool ? :] [16:09:35] mutante: yes [16:09:43] and https://gerrit.wikimedia.org/r/#/c/313599/ for nodepool [16:09:53] yes. I can. [16:10:00] guess we can stop nodepool now [16:10:03] papaul: let me know when you're done with setting up RAID1 :) [16:10:03] (03PS3) 10Dzahn: nodepool: point to Jenkins on contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/313599 (https://phabricator.wikimedia.org/T95757) (owner: 10Hashar) [16:10:14] RECOVERY - puppet last run on copper is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [16:10:29] (03CR) 10Dzahn: [C: 032 V: 032] nodepool: point to Jenkins on contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/313599 (https://phabricator.wikimedia.org/T95757) (owner: 10Hashar) [16:10:36] nodepool is stopped [16:10:52] and rsync is still going on [16:10:53] nodepool: point to Jenkins on contint1001 - landed [16:11:11] (03PS2) 10Dzahn: cache::misc: switch gallium to contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/318246 (https://phabricator.wikimedia.org/T95757) [16:11:14] rsyncing the docroots ( rsync --delete --info=progress2 -az rsync://gallium.wikimedia.org/docroot /srv/org [16:11:14] ) [16:11:33] (03PS1) 10Ema: varnish-backend-restart: workaround fallocate issues [puppet] - 10https://gerrit.wikimedia.org/r/319596 (https://phabricator.wikimedia.org/T149881) [16:11:34] hashar: what about the "announcement" change :) [16:11:55] https://integration.wikimedia.org/zuul/ CI will be unvailable for maintenance on Thursday 3rd Nov from 16:00 UTC to 18:00 UTC :D [16:12:04] 06Operations, 10Traffic, 13Patch-For-Review: varnish-be not restarting correctly because of disk space issues - https://phabricator.wikimedia.org/T149881#2768711 (10ema) p:05Triage>03Normal [16:12:04] :) [16:12:19] I'll run puppet on labnodepool (and probably re-stop nodepool after) [16:12:43] hashar: want me to switch doc. and integration. to new backend in varnish now? [16:12:44] PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [16:12:45] thcipriani: and check the host/ip in /etc/nodepool/nodepool.yaml [16:12:49] yarp [16:12:53] mutante: yeah I guess we can [16:12:58] mutante: I got the docroot updated [16:13:14] PROBLEM - HHVM rendering on mw1205 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.002 second response time [16:13:18] the builds history is still syncing. Such a pity the disk cache wiped out somehow it takes age [16:13:26] madhuvishy: ok boting now to start the install with on 12 disks on H700 [16:13:45] !log mw1205 - service hhvm restart [16:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:00] papaul: okay, just to confirm, H800 is connected, but the external storage shelves are not attached? [16:14:00] (03CR) 10Dzahn: [C: 032] cache::misc: switch gallium to contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/318246 (https://phabricator.wikimedia.org/T95757) (owner: 10Dzahn) [16:14:11] madhuvishy: correct [16:14:14] RECOVERY - HHVM rendering on mw1205 is OK: HTTP OK: HTTP/1.1 200 OK - 72165 bytes in 0.432 second response time [16:14:19] papaul: 👍 [16:14:22] (03CR) 10Dzahn: [V: 032] cache::misc: switch gallium to contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/318246 (https://phabricator.wikimedia.org/T95757) (owner: 10Dzahn) [16:14:49] CI config updated https://gerrit.wikimedia.org/r/293300 gallium is replaced by contint1001.eqiad.wmnet [16:15:11] (03PS2) 10Dzahn: contint: rm gallium from ferm rules in zuul::merger [puppet] - 10https://gerrit.wikimedia.org/r/318247 (https://phabricator.wikimedia.org/T95757) [16:15:24] that one next ^ since we switched [16:15:56] I would keep the gallium setup around though [16:15:57] puppet seems to be running on labnodepool1001, sure is taking a while. [16:15:59] in case we need to rollback [16:16:04] (03CR) 10Dzahn: [C: 032 V: 032] contint: rm gallium from ferm rules in zuul::merger [puppet] - 10https://gerrit.wikimedia.org/r/318247 (https://phabricator.wikimedia.org/T95757) (owner: 10Dzahn) [16:16:09] !log OS install on labstore2001 [16:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:38] and nodepool points to Jenkins public service hostname: 'https://integration.wikimedia.org/ci/' [16:16:47] (03PS1) 10Reedy: Elevate password policies for all users on private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319598 (https://phabricator.wikimedia.org/T149638) [16:16:50] so the switch in varnish misc is definitely needed [16:16:58] rsync of builds is complete ! [16:17:02] hashar: switch in varnish misc is merged [16:17:12] hashar: "switch zuul CNAME from gallium to contint1001" now? [16:17:18] DNS [16:17:26] yeah for zuul.eqiad.wmnet [16:17:28] still used by nodepool [16:17:50] ok, also here's an overview link https://gerrit.wikimedia.org/r/#/q/topic:gallium-migrate [16:18:05] (03PS2) 10Dzahn: switch zuul CNAME from gallium to contint1001 [dns] - 10https://gerrit.wikimedia.org/r/318249 (https://phabricator.wikimedia.org/T95757) [16:18:08] neat! [16:18:27] all rsync compltes [16:18:29] btw is this nodepool stuff fully ppe-tested, or we doing haywire mayhem in prod? :D [16:19:28] ppe-tested / haywire : I cant parse that :D [16:19:31] there is a migration plan [16:19:43] (03PS1) 10Reedy: Move and simplify some wikitech specific config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319600 [16:19:55] the new machine is identical to the old one -hopefully- [16:20:02] then we just swap some IP addresses here and there [16:20:07] (03CR) 10Dzahn: [C: 032 V: 032] switch zuul CNAME from gallium to contint1001 [dns] - 10https://gerrit.wikimedia.org/r/318249 (https://phabricator.wikimedia.org/T95757) (owner: 10Dzahn) [16:20:20] so I am still in "Switch" state [16:20:22] Change Jenkins build dir. /var/lib/jenkins/config.xml [16:20:24] RECOVERY - puppet last run on labvirt1014 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [16:20:24] RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [16:20:24] RECOVERY - puppet last run on labvirt1012 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [16:20:44] RECOVERY - puppet last run on labvirt1011 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [16:20:44] RECOVERY - puppet last run on labvirt1003 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [16:20:44] RECOVERY - puppet last run on labvirt1006 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [16:20:44] RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [16:20:45] zuul.eqiad.wmnet switched in DNS .. now [16:20:45] RECOVERY - puppet last run on labvirt1004 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [16:20:52] labnodepool /etc/nodepool/nodepool.yaml is updated; stopping nodepool again now. [16:20:54] RECOVERY - puppet last run on labvirt1007 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [16:20:54] RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [16:20:57] !log stopping and upgrading labsdb1009,10,11 (also disabling temporarily puppet) [16:20:59] 06Operations, 06Analytics-Kanban, 10hardware-requests: stat1001 replacement box in eqiad - https://phabricator.wikimedia.org/T149911#2768740 (10Ottomata) [16:21:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:04] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [16:21:04] RECOVERY - puppet last run on labvirt1008 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [16:21:14] RECOVERY - puppet last run on labvirt1002 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [16:21:14] RECOVERY - puppet last run on labvirt1005 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [16:21:28] 06Operations, 06Analytics-Kanban, 10hardware-requests: stat1001 replacement box in eqiad - https://phabricator.wikimedia.org/T149911#2768755 (10Ottomata) [16:21:33] I guess we can start Jenkins now [16:21:34] RECOVERY - puppet last run on labvirt1009 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [16:21:54] RECOVERY - puppet last run on labvirt1001 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [16:22:32] labnodepool1001.eqiad.wmnet still resolves zuul.eqiad.wmnet to gallium [16:22:34] (03PS1) 10Reedy: Load OATHAuth on wikitech same as other wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319601 [16:22:36] hashar - preproduction tested, not live prod testing [16:22:38] might want to flush dns cache there? [16:23:06] hashar: we should have lowered the TTL before i guess [16:23:12] I probably don't have permission for that. [16:23:34] (03PS1) 10Reedy: Remove commented OpenID config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319602 [16:23:38] i have one more that isnt just "decom gallium" for later: [16:23:41] or we can drop that service entry [16:23:42] contint: remove gallium conditional from contint::master_dir [16:24:05] and have nodepool point directly to contint1001.wikimedia.org instead of zuul.eqiad.wmnet [16:24:34] for gearman? I can make a patch. [16:24:49] lets do that [16:25:01] I think that the last use for zuul.eqiad.wmnet entry [16:25:07] (03PS2) 10Jcrespo: Install MariaDB 10.1 on New labsdb replicas [puppet] - 10https://gerrit.wikimedia.org/r/319558 (https://phabricator.wikimedia.org/T149422) [16:25:44] PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [16:25:55] hmm.. rec_control: command not found [16:25:57] (03CR) 10Jcrespo: [C: 032 V: 032] Install MariaDB 10.1 on New labsdb replicas [puppet] - 10https://gerrit.wikimedia.org/r/319558 (https://phabricator.wikimedia.org/T149422) (owner: 10Jcrespo) [16:26:02] that would be for flushing from cache [16:26:10] bah [16:26:59] bblack: do i use something else that replaced "rec_control" to clear a record from cache? [16:27:19] (03PS1) 10Thcipriani: Use contint1001.wikimedia.org as gearman host [puppet] - 10https://gerrit.wikimedia.org/r/319604 [16:27:51] ^ remove zuul.eqiad.wmnet from nodepool config yaml [16:28:14] does that need ferm changes? [16:28:21] trying to spawn jenkins [16:28:30] (03PS2) 10Dzahn: Use contint1001.wikimedia.org as gearman host [puppet] - 10https://gerrit.wikimedia.org/r/319604 (owner: 10Thcipriani) [16:28:39] ah jenkins is masked :] [16:28:45] # systemctl -p LoadState show jenkins.service [16:28:45] LoadState=masked [16:28:51] (03CR) 10Dzahn: [C: 032 V: 032] Use contint1001.wikimedia.org as gearman host [puppet] - 10https://gerrit.wikimedia.org/r/319604 (owner: 10Thcipriani) [16:29:45] hashar: ah, yes. .systemctl unmask jenkins.service [16:29:53] yeah I remember you did that intentioanlly [16:29:54] 06Operations, 10netops: Investigate why disabling an uplink port did not deprioritize VRRP on cr2-eqiad - https://phabricator.wikimedia.org/T119759#2768782 (10mark) It's actually working as designed. Our current configuration looks like: track { interface ae4.1004 {... [16:30:01] PROBLEM - zuul_gearman_service on contint1001 is CRITICAL: connect to address 127.0.0.1 and port 4730: Connection refused [16:30:02] to make sure we did not get jenkins to spawn magically [16:30:11] PROBLEM - zuul_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server [16:30:19] hrm. nc from labnodepool to contint1001 for gearman is refused on 4730. Does it need ferm changes? [16:30:36] i'll leave those icinga messages un-ACKed, nice to see them recover [16:30:41] RECOVERY - jenkins_service_running on contint1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [16:30:49] well we would ses recovery either way [16:30:57] nc -vz contint1001.wikimedia.org -w1 4730 -> contint1001.wikimedia.org [208.80.154.17] 4730 (?) : Connection refused [16:31:10] yes, i expected that too [16:31:35] modules/contint/manifests/firewall.pp: port => '4730', [16:31:41] RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [16:31:58] (03PS1) 10Yurik: LABS: Enable Map (GeoJSON) data on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319605 (https://phabricator.wikimedia.org/T149548) [16:31:58] $zuul_merger_hosts_ferm = join($zuul_merger_hosts, ' ') [16:32:00] stopped jenkins again [16:32:04] manually disabling gearman client [16:32:14] $zuul_merger_hosts = hiera('contint::zuul_merger_hosts') [16:32:30] thcipriani: yeah zuul not up yet [16:32:37] papaul: any update? [16:32:39] ah, right :) [16:32:44] I would like to confirm that jenkins is up first [16:32:53] madhuvishy: rebooting to OS [16:33:04] common/contint.yaml:contint::zuul_merger_hosts: [16:33:04] labs/integration/common.yaml:contint::zuul_merger_hosts: [16:33:30] eh, no, that's scandium [16:33:35] https://integration.wikimedia.org/ci/ got me something [16:33:41] dangit puppet. stopping nodepool again. [16:34:09] papaul: cool! [16:34:58] mutante: did you get all misc varnishes up dated ? [16:34:58] madhuvishy: you want me to run the first puppet run and salt of you going to do it? [16:35:27] hashar: so far just let puppet run [16:35:43] papaul: it's now booted and needs cert signing etc, correct? I can do it [16:35:49] I got random 503 hitting jenkins urls [16:36:00] not sure whether it is varnish or the apache proxy in front of jenkins [16:36:08] madhuvishy: ok go ahead and let me know when you are done so we can go to step 2 [16:36:18] i am geting out of console [16:36:36] guess some entries are stall in cache [16:36:42] madhuvishy: i am out [16:37:21] hrm, I'm getting rando 503s as well: stuff like: https://integration.wikimedia.org/ci/static/da89bd23/images/32x32/blue.png [16:37:30] eg GET of https://integration.wikimedia.org/ci/computer/gallium/ [16:37:31] hashar: i used the wrong server to run rec_control, now: [16:37:32] wiped 0 records, 0 negative records [16:37:45] X-Cachecp1051 pass, cp3008 pass, cp3010 pass [16:38:22] hashar: now checking the misc-web servers [16:38:41] PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [16:38:42] I guess misc cache just pass everything down to the backend [16:38:47] so maybe apache on contint1001 has some info [16:38:51] eg in /var/log/apache2 [16:40:31] jenkins is much faster now loading :) [16:40:31] hydrogen:~] $ sudo rec_control wipe-cache zuul.eqiad.wmnet [16:40:31] wiped 7 records, 0 negative records [16:40:31] hashar: ^ actual cache wipe [16:40:31] gone now [16:40:33] 503s seem to have subsided, seems like it was passing to gallium as a backend. [16:40:33] but if we use that, we have to revert the change to use contint1001 [16:40:37] mutante: magic! [16:40:42] also that saves us from having to do the ferm change [16:40:47] (03PS1) 10Ema: cache_text: switch to file storage backend on Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/319609 (https://phabricator.wikimedia.org/T142810) [16:40:48] and we keep using "internal" [16:40:54] I have setup contint1001 has a slave in Jenkins [16:41:07] guess we can start zuul && nodepool now [16:41:15] [chromium:~] $ sudo rec_control wipe-cache zuul.eqiad.wmnet [16:41:15] wiped 6 records, 0 negative records [16:41:18] marostegui: is es2019 in maintenance mode and off? [16:42:11] PROBLEM - Apache HTTP on mw1282 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.002 second response time [16:42:11] PROBLEM - HHVM rendering on mw1282 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.006 second response time [16:42:15] any order for those? Want to start zuul and then I'll fire up nodepool? [16:42:37] yup [16:42:38] Zuul Server: /etc/default/zuul is not set to START_DAEMON=1: exiting: failed! [16:42:42] ahh [16:42:53] ^^ i got that too [16:42:58] when i did that on a test instance [16:42:59] so something is not provisionned properly [16:43:01] wait, we have to revert one [16:43:11] RECOVERY - Apache HTTP on mw1282 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.033 second response time [16:43:11] RECOVERY - HHVM rendering on mw1282 is OK: HTTP OK: HTTP/1.1 200 OK - 72168 bytes in 0.112 second response time [16:43:11] (03PS1) 10Dzahn: Revert "Use contint1001.wikimedia.org as gearman host" [puppet] - 10https://gerrit.wikimedia.org/r/319610 [16:43:18] hashar: ^ correct? [16:43:28] since we keep using zuul.eqiad.wmnet now [16:43:39] mutante: lets stick to contint1001 [16:43:47] and get rid of zuul.eqiad.wmnet entirely [16:43:58] hashar: that requires a new ferm change [16:45:11] hashar: then why did the DNS cache wipe do anything ?:) [16:45:21] I don't understand why: contint firewall allows gearman from the nodepool host, right? [16:45:32] are you sure you want to make a last-minute change to the plan [16:45:34] (I don't have access to iptables to see any nuance, just basing off puppet) [16:45:43] maybe the ferm rule uses @resolve(gallium) ? [16:46:01] they should be ferm rules on contint1001, tho? [16:46:39] @contint1001:~# iptables -L | grep 4730 [16:46:39] ACCEPT tcp -- labnodepool1001.eqiad.wmnet anywhere tcp dpt:4730 [16:46:42] ACCEPT tcp -- scandium.eqiad.wmnet anywhere tcp dpt:4730 [16:46:45] that? [16:47:00] (wiped out /etc/default/zuul , puppet did not manage to update it) [16:47:25] that should be all we need, I think for nodepool to work with contint1001 with the "use contint1001 as gearman host" thing. [16:47:45] once gearman is running [16:50:01] RECOVERY - zuul_gearman_service on contint1001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 4730 [16:50:01] so I got zuul to start [16:50:08] had to systemd reload [16:50:11] RECOVERY - zuul_service_running on contint1001 is OK: PROCS OK: 2 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server [16:50:14] stop and start the systemctl unit [16:50:21] ah [16:50:24] hence nodepool should be able to reach it [16:50:30] kk, starting nodepool [16:50:59] nc worked, FYI on labnodepool [16:51:05] neat [16:51:22] (03Abandoned) 10Dzahn: Revert "Use contint1001.wikimedia.org as gearman host" [puppet] - 10https://gerrit.wikimedia.org/r/319610 (owner: 10Dzahn) [16:51:30] Jenkins web interface looks more or less okish https://integration.wikimedia.org/ci/ [16:51:41] RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [16:52:41] hrm, it shows up as syn_sent in netstat for 4730 [16:53:14] now established [16:53:29] (03PS1) 10Hashar: zuul-gearman-status.py requires python-gear [puppet] - 10https://gerrit.wikimedia.org/r/319612 [16:53:33] tcp 0 0 10.64.20.18:52694 208.80.154.17:4730 ESTABLISHED - [16:53:36] :) [16:53:40] seems to be working [16:53:42] nice [16:53:47] (03CR) 10jenkins-bot: [V: 04-1] zuul-gearman-status.py requires python-gear [puppet] - 10https://gerrit.wikimedia.org/r/319612 (owner: 10Hashar) [16:54:06] on contint1001 , I wrote a thin client to query the gearman server: zuul-gearman.py status [16:54:19] that list all the functions registered [16:54:34] apparently zuul-merger on scandium is connected [16:55:01] connecting Jenkins to Zuul gearman [16:55:29] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/319612 (owner: 10Hashar) [16:55:36] hrm, nodepool list shows a lot of gallium targets [16:55:47] (03CR) 10Dzahn: [C: 04-1] "typo. the word "package" is misspelled" [puppet] - 10https://gerrit.wikimedia.org/r/319612 (owner: 10Hashar) [16:55:50] yeah that is all the nodes [16:55:55] that nodepool knows/maintains [16:56:03] most should be in "ready" state [16:56:06] yep [16:56:07] hashar is it expected that zuul website shows a website declined? [16:56:08] all are [16:56:11] https://integration.wikimedia.org/zuul/ [16:56:13] eg they spawned on openstack and got added to the target jenkins [16:56:14] paladox: yes [16:56:17] Ok [16:56:29] if one are still attached to "gallium" we can get rid of them: nodepool delete 1234 [16:56:36] where 1234 is the id of a node attached to gallium [16:56:41] ok, doing [16:57:56] (03PS2) 10Dzahn: zuul-gearman-status.py requires python-gear [puppet] - 10https://gerrit.wikimedia.org/r/319612 (owner: 10Hashar) [16:58:11] (03CR) 10jenkins-bot: [V: 04-1] zuul-gearman-status.py requires python-gear [puppet] - 10https://gerrit.wikimedia.org/r/319612 (owner: 10Hashar) [16:58:11] grrrit-wm1: why dont you mention my upload [16:58:14] eh...crap. Since gallium is no more listed in the targets in the nodepool.yaml it's throwing an exception :( [16:58:17] there we go [16:59:58] the Jenkins jobs are not all registered in gearman bha [17:00:04] yurik, gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161103T1700). Please do the needful. [17:00:04] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring: deploy prometheus node_exporter for host monitoring - https://phabricator.wikimedia.org/T140646#2768878 (10akosiaris) [17:00:09] 06Operations, 06Labs, 10netops, 05Prometheus-metrics-monitoring: Firewall rules production/labs for prometheus-node-exporter - https://phabricator.wikimedia.org/T149253#2768875 (10akosiaris) 05Open>03Resolved a:03akosiaris After a couple of rounds, finally done. Tested with telnet on a few hosts and... [17:00:14] hashar: do you need that script now? [17:00:29] zuul-german-status [17:00:39] ORES has deployment now if the phasing out gallium is finished [17:00:52] Amir1: it's not finished [17:00:53] but it can wait [17:00:54] Great [17:00:58] On standby [17:01:00] restarting Jenkins [17:01:05] it did not read all the config files [17:01:10] mutante: thanks, tell us when it's done [17:02:23] Amir1: the scheduled window is one more hour [17:02:38] Nov 03, 2016 4:30:04 PM jenkins.InitReactorRunner$1 onTaskFailed [17:02:38] SEVERE: Failed Loading job operations-puppet-typos [17:02:39] :( [17:03:09] 06Operations, 10ops-eqiad, 10netops: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2768895 (10faidon) The 2<->5 link was due to a faulty cable. Chris has replaced that and the stack is fully formed now, albeit with not much redundancy (still waiting... [17:04:19] there are a lot of jobs missing :\ [17:04:24] 06Operations, 06Discovery, 06Maps: Investigate how Kartotherian metrics are published and what they mean - https://phabricator.wikimedia.org/T149889#2767903 (10Yurik) The "req" metrics is being sent by [[ https://github.com/kartotherian/kartotherian-server/blob/master/lib/tiles.js#L66 | tiles.js ]]: ```kart... [17:05:18] yeah [17:05:19] :( [17:05:29] something is off with the build history [17:05:36] I am really tempted to just wipe it out entirely [17:05:42] is this because there are symlinks in the jobs to builds? [17:05:42] hey guys, is the CI stuff down? https://integration.wikimedia.org/ci/job/wikimedia-portals-npm-node-4-jessie/None/console [17:05:58] jan_drewniak: yup, scheduled maintenance for the next hour [17:05:59] jan_drewniak: planned maintenance [17:06:03] papaul: figuring out some salt stuff now, will report status in a minute [17:06:04] OH yeah [17:06:08] thcipriani: you are so rights [17:06:14] the jobs have symliks to /var/lib/jenkins/builds [17:06:16] so since the symlinks are broken it won't load jobs? [17:06:22] hashar, mutante - are the config scaps for labs-only files allowed? [17:06:32] it seems it overlaps with services window: https://wikitech.wikimedia.org/wiki/Deployments [17:07:02] thcipriani: and it even wiped out the related build history [17:07:04] (03PS1) 10Ema: site: apply role::systemtap::devserver to copper [puppet] - 10https://gerrit.wikimedia.org/r/319616 [17:07:08] yurik: later. we are migrating [17:07:17] ok [17:07:21] (03CR) 10jenkins-bot: [V: 04-1] site: apply role::systemtap::devserver to copper [puppet] - 10https://gerrit.wikimedia.org/r/319616 (owner: 10Ema) [17:07:22] thcipriani: I would just wipe all the symlinks / build number tracker and history [17:07:24] and start a fresh [17:07:36] so /var/lib/jenkins/builds/**/lastSuccessfulBuild and /var/lib/jenkins/builds/**/lastStableBuild [17:07:36] else we can rsync again the whole build historuy [17:07:42] PROBLEM - jenkins_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [17:07:47] then make a one liner to update all symlinks [17:07:55] ah yeah [17:08:27] 06Operations, 06Performance-Team, 10Thumbor: thumbor imagemagick filling up /tmp on thumbor1002 - https://phabricator.wikimedia.org/T145878#2768914 (10ori) RLIMIT_FSIZE >>! In T145878#2768398, @Gilles wrote: > Actually I think that the systemd limit is undesirable as the limit would be over the lifetime of... [17:09:13] would rsyncing just the last successful and last stable be enough to make it happy? [17:09:34] I dont think they are needed [17:09:42] fair enough [17:09:42] RECOVERY - jenkins_service_running on contint1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [17:09:58] give it a shot with beta-scap-eqiad: nobody'll miss that history [17:10:39] yeah [17:11:03] saw you removed :) [17:11:10] is it restarting? [17:11:48] madhuvishy: is it possible to get to the salt section later and login to the OS and see if you can see both controllers? [17:11:52] I have nuked the builds [17:11:55] starting again [17:12:02] okie doke [17:12:17] 06Operations, 10ops-eqiad, 10hardware-requests, 06Services (watching): Reclaim aqs100[123] - https://phabricator.wikimedia.org/T147926#2768927 (10RobH) @Cmjohnson: Are these R720xd's able to be retrofitted with SFF SSDs without issue? Please advise. If so, we'll need to order some Intel S3610 SSDs to pl... [17:12:39] it is better this time [17:13:19] (03CR) 10Dzahn: [C: 031] nodepool: bump nova client and openstack CLI [puppet] - 10https://gerrit.wikimedia.org/r/306220 (https://phabricator.wikimedia.org/T137217) (owner: 10Hashar) [17:13:21] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/319612 (owner: 10Hashar) [17:13:44] crap. nodepool refuses to cleanup since we removed the key gallium.wikimedia.org. There's no way for it to connect to that machine to drop nodes is there? [17:13:59] (03CR) 10Dzahn: [C: 031] contint: drop contint::packages [puppet] - 10https://gerrit.wikimedia.org/r/317988 (owner: 10Hashar) [17:14:02] RECOVERY - jenkins_zmq_publisher on contint1001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 8888 [17:14:07] I guess it is a bug in nodepool [17:14:20] it tries to unpool the instances from the Jenkins master that is no more existing [17:14:22] PROBLEM - puppet last run on db1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:14:28] can probably clean it up directly from the database [17:14:42] fun stuff. [17:15:23] (03PS1) 10Mark Bergsma: Reflect new FPC3 ports after cr1-/cr2-eqiad FPC5 decommissioning [dns] - 10https://gerrit.wikimedia.org/r/319617 [17:15:23] the credentials are in /etc/nodepool/nodepool.yaml [17:15:47] or maybe just mark them as deleted [17:15:53] and at one point nodepool will garbage collect them [17:15:56] papaul: sure let's do that [17:16:06] checking the controllers now [17:16:22] 06Operations, 10ops-eqiad, 10hardware-requests, 06Services (watching): Reclaim aqs100[123] - https://phabricator.wikimedia.org/T147926#2768932 (10Cmjohnson) Sure, they can be retrofitted I have plenty of LFF/SFF adapters .....but are you sure we want to do that?....they're out of warranty and aqs1003. [17:16:22] RECOVERY - puppet last run on db1029 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [17:16:25] mutante: looks mostly fine :) [17:16:35] https://integration.wikimedia.org/zuul/ fails with a 403 :/ [17:16:36] (03Abandoned) 10Dzahn: cache_misc: change doc/integration.wm.o backend [puppet] - 10https://gerrit.wikimedia.org/r/313600 (owner: 10Hashar) [17:16:46] hashar: :) [17:16:49] I am sure it is some apache config oddity [17:16:54] madhuvishy: ok [17:16:54] (03CR) 10Mark Bergsma: [C: 04-1] "I'm going to move xe-3/0/3 (pfw1) for the move, it's now taken for uplinks to row D" [dns] - 10https://gerrit.wikimedia.org/r/319617 (owner: 10Mark Bergsma) [17:17:11] restarting apache2 to be sure [17:17:33] hashar: switch from 2.2 to 2.4 , right [17:17:36] the proxy to zuul web service works at least https://integration.wikimedia.org/zuul/status.json?foobar [17:18:04] hashar: it will be the syntax change "allow/deny from" to "require all granted" etc.. i guess [17:18:09] papaul: [17:18:11] https://www.irccloud.com/pastebin/QWjcxV0c/ [17:18:21] H800 didn't show up [17:18:28] hashar: looking [17:19:09] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring: deploy prometheus node_exporter for host monitoring - https://phabricator.wikimedia.org/T140646#2768935 (10fgiunchedi) 05Open>03Resolved This is completed, node_exporter is deployed on all hosts and being polled by prometheus. [17:19:16] 06Operations, 10ops-eqiad, 10hardware-requests, 06Services (watching): Reclaim aqs100[123] - https://phabricator.wikimedia.org/T147926#2768937 (10RobH) Oh, these are LFF bay systems? (I didn't think the R720xd could retrofit SFF into LFF bays on the hot swap chassis.) At LFF, its only 12 bays correct? M... [17:19:35] AH01630: client denied by server configuration :D [17:19:57] = 2.4> [17:19:57] 25 Require all granted [17:19:59] https://doc.wikimedia.org/ 403 as well [17:19:59] needs that [17:20:27] madhuvishy: ok we are going to take the controller out of labstore2002 and put it in labstore2001 and run the command again and see [17:20:35] papaul: okay [17:21:00] papaul: do we need to shut down to do that? [17:21:03] madhuvishy: please poweroff labstore2001 while and removing the controller form 2002 [17:21:14] okay [17:21:18] madhuvishy: thanks [17:21:26] hashar: do you think it's fine to run: update node set state=4 where target_name='gallium.wikimedia.org'; ? [17:21:38] or should I just delete them from the db? [17:21:44] for nodepool [17:22:03] (03PS1) 10Jcrespo: labsdb: enable socket authentication [puppet] - 10https://gerrit.wikimedia.org/r/319618 (https://phabricator.wikimedia.org/T140452) [17:22:07] papaul: okay I shut it down [17:22:07] thcipriani: no idea what state=4 is [17:22:16] delete evidently [17:22:17] thcipriani: I would just remove any nodes having target_name=gallium [17:22:27] hashar: ok, doing. [17:22:32] (03PS1) 10Dzahn: integration.wm: update Apache config to 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/319619 [17:23:20] (03PS2) 10Dzahn: integration.wm: update Apache config to 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/319619 (https://phabricator.wikimedia.org/T95757) [17:23:36] (03CR) 10Hashar: "good finding!" [puppet] - 10https://gerrit.wikimedia.org/r/319619 (https://phabricator.wikimedia.org/T95757) (owner: 10Dzahn) [17:23:51] mutante: and I thought we had every single apache conf snippets updated :( [17:23:55] (03PS3) 10Dzahn: integration.wm: update Apache config to 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/319619 (https://phabricator.wikimedia.org/T95757) [17:24:13] thcipriani: CI is mostly up from what I can tell [17:24:18] (03CR) 10Dzahn: [C: 032] integration.wm: update Apache config to 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/319619 (https://phabricator.wikimedia.org/T95757) (owner: 10Dzahn) [17:24:23] (03CR) 10Dzahn: [V: 032] integration.wm: update Apache config to 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/319619 (https://phabricator.wikimedia.org/T95757) (owner: 10Dzahn) [17:24:34] hashar: I think deleting those nodes in the db made nodepool happy, watching debug log now [17:25:04] thcipriani: the trick is "nodepool list" is just a representation of what is in the database [17:25:16] that is not always in sync with what is on the openstack project [17:25:23] yeah, wasn't sure what other havoc that might cause :) [17:25:30] (03PS2) 10Jcrespo: labsdb: enable socket authentication [puppet] - 10https://gerrit.wikimedia.org/r/319618 (https://phabricator.wikimedia.org/T140452) [17:25:36] we can query/interact with openstack directly with: become-nodepool [17:25:41] then use: openstack server list [17:25:42] hashar: we need to have those instances manually deleted [17:25:45] openstack server delete 1234 [17:26:01] we're hitting our instance limit because nodepool isn't aware of how many machines are allocated to contintcloud now [17:26:02] mutante: ran puppet [17:26:16] no more error but blank page? [17:26:20] thcipriani: then I guess be bold and delete :) [17:26:26] oh boy [17:26:31] 06Operations, 10hardware-requests: Analytics AQS cluster expansion - https://phabricator.wikimedia.org/T149920#2768949 (10elukey) [17:26:38] mutante: blank pages now https://integration.wikimedia.org/zuul/ :] [17:26:43] 06Operations, 10hardware-requests: Analytics AQS cluster expansion - https://phabricator.wikimedia.org/T149920#2768961 (10elukey) p:05Triage>03Normal [17:27:05] no apache error in log [17:27:08] PHP Warning: require_once(/srv/org/wikimedia/integration/zuul/../../../../shared/IntegrationPage.php): failed to open stream: No such file or directory in /srv/org/wikimedia/integration/zuul/index.php on line 2 [17:27:18] (03CR) 10Jcrespo: [C: 032] labsdb: enable socket authentication [puppet] - 10https://gerrit.wikimedia.org/r/319618 (https://phabricator.wikimedia.org/T140452) (owner: 10Jcrespo) [17:28:56] hashar: i'm looking at doc.wm now, can confirm it's hitting the new backend for sure [17:29:29] hashar: as opposed to integration it does not have the 2.2 config syntax [17:29:36] 06Operations, 06Discovery, 06Maps: Investigate Swift as a storage backend for maps tiles - https://phabricator.wikimedia.org/T149885#2767720 (10MaxSem) For historical perspective, when we were working on initial design, I did investigate Swift. I decided against it because of its reputation for being slow an... [17:29:49] i can also see Windows UAs hitting doc.wm [17:31:17] reality now matches nodepools idea of reality [17:31:22] *nodepool's [17:31:29] awesome [17:33:00] madhuvishy: 2001 is up login back in and run the commmand [17:33:03] madhuvishy: thanks [17:33:11] papaul: doing that now [17:33:15] are we missing /srv/zuul-status now? [17:33:25] back [17:33:36] I mean it is back [17:33:38] hooray [17:34:28] client denied by server configuration: /srv/org/wikimedia/doc/rubygems/mediawiki-ruby-api/ [17:34:31] doc.wikimedia.org isent working though [17:34:36] papaul: same [17:34:40] https://www.irccloud.com/pastebin/ANiLOsAV/ [17:34:53] (03PS2) 10BBlack: varnish-backend-restart: workaround fallocate issues [puppet] - 10https://gerrit.wikimedia.org/r/319596 (https://phabricator.wikimedia.org/T149881) (owner: 10Ema) [17:34:55] paladox: < mutante> hashar: i'm looking at doc.wm now, .. [17:34:56] madhuvishy: so the problem is not the controller [17:35:02] Ok [17:35:04] thanks [17:35:05] (03CR) 10BBlack: [C: 032 V: 032] varnish-backend-restart: workaround fallocate issues [puppet] - 10https://gerrit.wikimedia.org/r/319596 (https://phabricator.wikimedia.org/T149881) (owner: 10Ema) [17:35:16] I filled a dummy task [17:35:16] https://phabricator.wikimedia.org/T149924 [17:35:21] the OS is not reading the controller it has to be drivers missing within the OS [17:35:30] papaul: ah hmmm [17:35:40] madhuvishy: what OS do we have on 1001 [17:35:55] (03PS2) 10BBlack: Text VCL: Fix cookie handling for Varnish4 [puppet] - 10https://gerrit.wikimedia.org/r/319561 (https://phabricator.wikimedia.org/T131503) [17:36:00] (03CR) 10BBlack: [C: 032 V: 032] Text VCL: Fix cookie handling for Varnish4 [puppet] - 10https://gerrit.wikimedia.org/r/319561 (https://phabricator.wikimedia.org/T131503) (owner: 10BBlack) [17:36:35] thcipriani: mutante: beside a few left over oddities, I think it is a success overall [17:36:41] at this point I dont think we need to rollback [17:36:48] papaul: Debian GNU/Linux 8.2 (jessie) [17:36:54] ok [17:37:12] madhuvishy: power off 2001 again let me try one more thing [17:37:14] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/319616 (owner: 10Ema) [17:37:54] papaul: done [17:38:36] physically i i can see it and login into the manager section [17:38:44] madhuvishy: give me a minute [17:38:48] papaul: sure [17:39:20] hashar: on gallium we have an Apache config snippet called "50-listen-localhost-9412.conf" that we do not have on contint1001. known? [17:39:28] PROBLEM - HHVM rendering on mw1228 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.001 second response time [17:39:37] (03PS1) 10Faidon Liambotis: admin: allow DC-Ops to access RAID controller tools [puppet] - 10https://gerrit.wikimedia.org/r/319621 [17:39:47] mutante: yeah that is not needed anymore [17:40:02] ok [17:40:13] mutante: that port 9412 is used for running tests against a live mediawiki. Which we no more do on gallium/contint1001 [17:40:16] do we need an require all granted thing for the docs site? [17:40:28] RECOVERY - HHVM rendering on mw1228 is OK: HTTP OK: HTTP/1.1 200 OK - 72163 bytes in 0.095 second response time [17:40:38] thcipriani: i thought that's it , like for integration.wm, but there is none in the config [17:40:58] is jenkins not working at all or is it just the web page that isnt workin [17:41:15] (03CR) 10Mark Bergsma: [C: 032] admin: allow DC-Ops to access RAID controller tools [puppet] - 10https://gerrit.wikimedia.org/r/319621 (owner: 10Faidon Liambotis) [17:41:27] (03PS2) 10Faidon Liambotis: admin: allow DC-Ops to access RAID controller tools [puppet] - 10https://gerrit.wikimedia.org/r/319621 [17:41:28] Zppix: which page? [17:41:43] CI [17:41:54] mutante: maybe we should add it, could be an apache version thing? [17:41:59] Zppix: please be specific [17:42:02] Zppix: there is a CI maintenance on going [17:42:11] wfm https://integration.wikimedia.org/ci/ [17:42:16] ah sorry i meant integration [17:42:23] just found we also have a /srv/org/wikimedia/doc/.htaccess file [17:42:31] oh! [17:42:39] eww [17:42:42] greg-g: im aware im trying to figure out whats wrong so i can determine if its possible i could help debug or whatever [17:43:21] i can copy that over, but let's move it to regular config [17:44:21] +1 [17:44:31] copied [17:44:44] maybe the default in apache is now to disable all of / [17:44:54] yes, i'm trying that now [17:45:30] also, we have integration-mediawiki.org.conf [17:45:38] and only changed integration-wikimedia.org.conf maybe [17:45:41] or at some point on gallium we had a rule allowing /srv [17:46:28] Zppix: I think they have it under control [17:46:36] ack [17:46:37] fixed [17:46:42] https://doc.wikimedia.org/ [17:46:46] patch coming [17:47:13] nice [17:47:14] oh man [17:47:21] ? [17:47:33] ? [17:49:03] reminder for .htaccess https://phabricator.wikimedia.org/T149928 [17:49:16] looks like now we just have public-on-gallium needs to be redefined and for some reason the puppet-compiler node didn't attach, does that seem right? [17:49:26] *publish-on-gallium [17:49:27] (03PS3) 10Arseny1992: Enable OATHAuth on all private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319035 (https://phabricator.wikimedia.org/T149614) [17:49:48] so publish on gallium yeah [17:49:55] if doc breaks again it's puppet run,, will be fixed in a minute [17:50:01] * hashar updates the google doc [17:50:14] https://gerrit.wikimedia.org/r/293300 [17:50:14] Update integration/config [17:50:14] Rsync config/build [17:50:23] I have merged it to update the fabfile [17:51:28] hrm, for https://integration.wikimedia.org/ci/computer/compiler02.puppet3-diffs.eqiad.wmflabs/ seems like there's maybe a ferm rule there. [17:51:32] that needs an update [17:52:18] or a security rule in the labs project [17:52:41] (03PS1) 10BBlack: cache jemalloc: rough tuning [puppet] - 10https://gerrit.wikimedia.org/r/319625 [17:52:43] (03PS1) 10BBlack: cache_text: use file storage [puppet] - 10https://gerrit.wikimedia.org/r/319626 (https://phabricator.wikimedia.org/T131503) [17:52:44] ah, right [17:52:55] (03PS1) 10Dzahn: contint: fix Apache config of doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/319627 (https://phabricator.wikimedia.org/T95757) [17:53:27] I am regenerating the -publish jobs [17:53:33] I don't have access to the project :( [17:53:35] and the publish-on-contint1001 job [17:53:54] (03PS2) 10Dzahn: contint: fix Apache config of doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/319627 (https://phabricator.wikimedia.org/T95757) [17:54:25] doesn't look like there are any ferm rules in the puppet_compiler module. [17:54:34] must be on the tenant [17:54:35] mutante: do you have access to the puppet3-diffs project? [17:54:42] (03CR) 10Dzahn: [C: 032] "yep, the default in 2.4 is to deny / now. so this is needed but wasn't in 2.2. adding the standard snippet anyways" [puppet] - 10https://gerrit.wikimedia.org/r/319627 (https://phabricator.wikimedia.org/T95757) (owner: 10Dzahn) [17:55:11] (03CR) 10Dzahn: [V: 032] "already applied on contint1001" [puppet] - 10https://gerrit.wikimedia.org/r/319627 (https://phabricator.wikimedia.org/T95757) (owner: 10Dzahn) [17:55:18] (03PS3) 10Dzahn: contint: fix Apache config of doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/319627 (https://phabricator.wikimedia.org/T95757) [17:55:21] "Security Group creation disabled temporarily, see https://phabricator.wikimedia.org/T142165 for details" [17:55:22] bah [17:55:32] (03CR) 10Dzahn: [V: 032] contint: fix Apache config of doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/319627 (https://phabricator.wikimedia.org/T95757) (owner: 10Dzahn) [17:55:58] thcipriani: I have access to the instance apparently [17:56:06] oh [17:56:16] no iptables rules there [17:56:28] thcipriani: yes [17:56:53] trying to figure out why contint1001 can't connect to some hosts in that project [17:56:56] https://integration.wikimedia.org/ci/computer/compiler02.puppet3-diffs.eqiad.wmflabs/ [17:57:18] well [17:57:22] afaict, only weirdness left? [17:57:28] for now yes [17:57:53] i dont see any instances in that project in wikitech [17:58:16] the instance is in specific project [17:58:16] i guess it's the security group thing you already said [17:58:28] if we cant add any now.. i dunno [17:58:40] would have to ask labs ops I guess [17:58:45] yea [17:58:46] or maybe they can be edited in horizon [17:58:59] I am booting my 2factor device to log in there [18:00:04] gehel: Dear anthropoid, the time has come. Please deploy Wikidata query service (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161103T1800). [18:00:04] SMalyshev and Jonas_WMDE: A patch you scheduled for Wikidata query service is about to be deployed. Please be available during the process. [18:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161103T1800). Please do the needful. [18:00:05] ottomata, MaxSem, and yurik: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:22] you can deploy as needed [18:00:32] CI is mostly done and should overall work just fine [18:00:37] jenkins-bot liked my change :) [18:00:47] (03CR) 10Ema: [C: 031] cache jemalloc: rough tuning [puppet] - 10https://gerrit.wikimedia.org/r/319625 (owner: 10BBlack) [18:00:48] ok. [18:00:54] mutante: that sounds like a Facebook friend +1 on your wall :D [18:00:56] am here! [18:00:57] that was very efficient [18:01:03] good timing [18:01:05] I can run SWAT. [18:01:06] compiler02 instance is in the labs project puppet3-diffs [18:01:18] just out of curosity, does jenkins queue up stuff it missed? [18:01:32] mutante: that is a firewall rule in openstack :] [18:01:39] I think we'll have to manually requeue everything? [18:01:55] hashar: in puppet code or in horizon web ui? [18:01:55] that sucks [18:02:09] solved! [18:02:15] hah [18:02:38] !log Added security rule for "puppet3-diffs" labs project to allow ssh connection from contint1001 instead of gallium [18:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:02] well, +2 for SWAT seems to work for eventbus. [18:03:08] thcipriani, /me is here [18:03:15] Nothing to deploy on WDQS, this is just a left over from last week. I'm cleaning up the schedule [18:03:18] Zppix: what sucks even more is running a precise instance, which we dont do anymore now. sometimes there is a price to pay [18:03:19] MaxSem: hi, just getting started. [18:03:38] mutante -l release=trusty? [18:03:43] percise rather [18:04:03] Zppix: yes, gallium was precise [18:04:19] the goal is to kill all of them [18:04:21] (03PS2) 10BBlack: cache jemalloc: rough tuning [puppet] - 10https://gerrit.wikimedia.org/r/319625 [18:04:27] (03CR) 10BBlack: [C: 032 V: 032] cache jemalloc: rough tuning [puppet] - 10https://gerrit.wikimedia.org/r/319625 (owner: 10BBlack) [18:04:34] thcipriani: mutante: I am claiming that CI is migrated to contint1001. [18:04:40] hashar: yay !:) [18:04:49] thcipriani: mutante: thank you both for all the assistance and I am quite happy we made it on time [18:04:49] hashar: \o/ kudos [18:04:52] "JIT" [18:04:59] yw [18:05:02] still have to remove the notification on /zuul :P [18:05:07] hashar, do we have an independent verification for these claims? :P [18:05:11] I am going to have dinner with the noisy underaged that around me [18:05:12] (03PS2) 10BBlack: cache_text: switch to file storage backend on Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/319609 (https://phabricator.wikimedia.org/T142810) (owner: 10Ema) [18:05:22] MaxSem: you!!!! :] [18:05:25] MaxSem: yes, jenkins-bot is voting [18:05:26] hashar: thcipriani any known issues? [18:05:40] can I reply to that maintenance announce email with "all clear!" [18:05:41] ? [18:05:47] not that I am aware of. hashar ? [18:05:48] greg-g: yeah a few oddities such as apache conf, a firewall rule missing and we lost all the build history (not a big deal) [18:05:54] all fixed [18:05:58] sweet [18:05:59] no build history? That about it? [18:06:02] yeah [18:06:15] we can survive that just fine [18:06:23] (03Abandoned) 10BBlack: cache_text: use file storage [puppet] - 10https://gerrit.wikimedia.org/r/319626 (https://phabricator.wikimedia.org/T131503) (owner: 10BBlack) [18:06:30] alright, I can reply all clear, hashar go eat dinner, thcipriani you seem to be busy with other things already :) [18:07:22] ottomata: your eventbus change for wmf.1 is live on mw1099, check please (if you have anything to check there) [18:08:12] thcipriani: https://gerrit.wikimedia.org/r/#/c/319629/ is a CR+2 from removing the red message on the status page [18:08:28] PROBLEM - HHVM rendering on mw1234 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.001 second response time [18:08:57] (03CR) 10Filippo Giunchedi: "Indeed, there was a question about running Prometheus on bastions e.g. in ulsfo, I've outlined some considerations here: https://wikitech." [puppet] - 10https://gerrit.wikimedia.org/r/309996 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [18:09:21] 06Operations, 10Traffic, 13Patch-For-Review: Strong cipher preference ordering for cache terminators - https://phabricator.wikimedia.org/T144626#2769186 (10BBlack) Had a chance to dig through our NavTiming performance metrics. There's some slight hints of improvement here and there, but probably just wishfu... [18:09:28] RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 72164 bytes in 0.119 second response time [18:09:33] (03CR) 10Dzahn: [C: 031] contint: remove gallium conditional from contint::master_dir [puppet] - 10https://gerrit.wikimedia.org/r/318217 (https://phabricator.wikimedia.org/T95757) (owner: 10Dzahn) [18:09:35] mutante: all clear thank you very much :] [18:09:48] thcipriani: looking [18:09:50] (03CR) 10Dzahn: [C: 031] contint: remove gallium from firewall::labs [puppet] - 10https://gerrit.wikimedia.org/r/318245 (https://phabricator.wikimedia.org/T95757) (owner: 10Dzahn) [18:10:03] hashar: congratulations on a migration well planned :) [18:10:18] (of course, now that I've said that it's certainly jinxed) [18:10:18] hashar: you're welcome. enjoy dinner, i'll wait a moment with the gallium removas [18:10:35] thcipriani: you're fired. [18:10:39] :D [18:10:48] *g* [18:10:53] greg-g: mid swat deploy. That's cold. [18:11:08] mutante: I would keep gallium around a bit more [18:11:14] thcipriani: :) :) [18:11:16] (03PS3) 10Ema: cache_text: switch to file storage backend on Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/319609 (https://phabricator.wikimedia.org/T142810) [18:11:29] hashar: mutante yeah, I would, too. (re gallium) "just in case" [18:11:30] mutante: there might be some unpuppetized stuff that is needed (unlikely) and probably homedir material worth moving [18:12:08] hashar: yes, definitely not wiping the disks yet, but stuff like that up there ^ and then removing from puppet [18:12:12] (03CR) 10Gehel: "Havign some throttling is obviously better than none. Having a somewhat loose limitation, is not a technical problem, but might surprise u" [puppet] - 10https://gerrit.wikimedia.org/r/319010 (https://phabricator.wikimedia.org/T108488) (owner: 10Smalyshev) [18:12:53] remove from puppet, wait, shutdown, wait, wipe [18:14:01] ottomata: I also pulled the change for wmf.23 over to mw1099 just now as well. [18:14:28] hashar this https://integration.wikimedia.org/ci/job/operations-puppet-doc/27637/console seems stuck? [18:14:41] oh wait, i mean failing [18:14:51] i thought doc was fixed for puppet? [18:15:01] (03CR) 10BBlack: [C: 032] cache_text: switch to file storage backend on Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/319609 (https://phabricator.wikimedia.org/T142810) (owner: 10Ema) [18:15:22] meh, we should fix "Warning: Unrecognised escape sequence '\;" [18:15:38] ok thcipriani, not totally sure if i did it right [18:15:39] but [18:15:39] i set [18:15:44] X-Wikimedia-Debug with [18:15:50] backend=mw1099.eqiad.wmnet [18:15:59] then i edited my talk page on en.wikpedia.org [18:16:03] sorry [18:16:04] my user page [18:16:10] and, i saw the event come through [18:16:14] so, that's good enough for me! [18:16:19] (if i actually tested it) [18:16:47] paladox: looks to me like it's still running and the Warnings are "normal" [18:16:50] nice. yeah, that's how to test it. There is a plugin for that for chrome, too :) [18:16:55] ok cool [18:16:56] Oh [18:17:01] paladox: nevertheless we should fix them if we can [18:17:04] yeah, i have a little extension that lets me set request headers, just used that [18:17:06] Yep [18:17:42] ottomata: yarp sounds good. OK, I'm going live with wmf.1 first, then I'll deploy wmf.23. [18:18:13] ah cool, yeah i see in post: . server:mw1099.eqiad.wmnet [18:18:14] so ya [18:18:16] +1 go ahead [18:19:52] !log thcipriani@tin Synchronized php-1.29.0-wmf.1/extensions/EventBus/EventBus.php: SWAT: [[gerrit:319587|Log more EventBus HTTP request/response context for HTTP errors (T148251)]] (duration: 00m 49s) [18:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:58] T148251: Empty body in EventBus request - https://phabricator.wikimedia.org/T148251 [18:20:16] madhuvishy:try running the command again [18:20:20] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 4 others: Integrating MediaWiki (and other services) with dynamic configuration - https://phabricator.wikimedia.org/T149617#2769260 (10GWicke) > Most of what is needed by MediaWiki (and potentially, any other service ne... [18:20:25] papaul: doing now [18:21:00] papaul: <3 [18:21:04] https://www.irccloud.com/pastebin/PuywTDzf/ [18:21:16] !log thcipriani@tin Synchronized php-1.28.0-wmf.23/extensions/EventBus/EventBus.php: SWAT: [[gerrit:319586|Log more EventBus HTTP request/response context for HTTP errors (T148251)]] (duration: 00m 52s) [18:21:18] (03PS1) 10Dzahn: wmde: capitalize resource reference [puppet] - 10https://gerrit.wikimedia.org/r/319635 [18:21:21] ^ ottomata should be live everywhere [18:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:29] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319605 (https://phabricator.wikimedia.org/T149548) (owner: 10Yurik) [18:23:30] madhuvishy: ok we find the problem [18:23:39] papaul: what was up? [18:23:47] I presume the "EU issues" is still valid? [18:24:17] madhuvishy: the problem is that the command we are running you are using | grep PERC [18:24:20] hmm, thcipriani ok. i'm still getting requests posted to eventbus that I woudlnt' expect if it was live everywhere [18:24:28] madhuvishy: or the new controller is not a PERC [18:24:41] ottomata: do you know which wikis they're coming from? [18:24:50] mostly commonswiki [18:25:14] (03Merged) 10jenkins-bot: LABS: Enable Map (GeoJSON) data on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319605 (https://phabricator.wikimedia.org/T149548) (owner: 10Yurik) [18:25:25] madhuvishy:it is a LSI 3 ware SATA+SAS sRAID contoller card [18:26:04] I'll spot-check a few servers [18:26:10] most from mw1302 [18:26:14] madhuvishy: i put in the old controller card the H800 which is PERC that is the reason you can see it now [18:26:18] lots [18:26:18] but that is the top server [18:26:34] (03CR) 10jenkins-bot: [V: 04-1] wmde: capitalize resource reference [puppet] - 10https://gerrit.wikimedia.org/r/319635 (owner: 10Dzahn) [18:26:46] !log repooling cp2016 (T131503) [18:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:51] T131503: Convert text cluster to Varnish 4 - https://phabricator.wikimedia.org/T131503 [18:28:26] thcipriani: i take it back [18:28:26] its live [18:28:30] but it is not behaving as I expected [18:28:34] ottomata: hrm. The change is live on mw1302. ...ah, ok :) [18:28:50] the stuff i'm getting is logstash looks like my change [18:30:28] MaxSem: Kartographer change is live on mw1099, check please (if possible) [18:30:42] madhuvishy: does it make sense? [18:31:24] papaul: aah interesting. I also tried listing all the cards without the grep before, and there was only one adapter in the list [18:31:37] madhuvishy: whis is? [18:31:42] H700 [18:31:55] (03PS1) 10Arseny1992: Fix comment refs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319644 [18:32:12] thcipriani, verifed that the patch doesn't cause problems - can't really verify that it fixes what it was intended to [18:32:31] Can someone merge ^ [18:32:32] MaxSem: Sure, thanks for the sanity check. Going live everywhere. [18:32:38] (03PS1) 10Mobrovac: RESTBase: Resurrect WikiData domains [puppet] - 10https://gerrit.wikimedia.org/r/319645 (https://phabricator.wikimedia.org/T149114) [18:33:09] madhuvishy: when you do | grep PERC you have both controllers and with no | grep option you have ony H700 [18:33:20] papaul: no now I have both [18:33:25] ok [18:33:47] !log thcipriani@tin Synchronized php-1.29.0-wmf.1/extensions/Kartographer/includes/ApiQueryMapData.php: SWAT: [[gerrit:319622|Fix warning (T149923)]] (duration: 00m 47s) [18:33:52] ^ MaxSem live everywhere [18:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:53] T149923: Kartographer API production warning in ApiQueryMapData.php - https://phabricator.wikimedia.org/T149923 [18:34:05] thcipriani ^ [18:34:13] papaul: so you're saying before, there was a new controller put in, but it was not a H800? [18:34:45] arseny92: can you put your change on the deployment calendar https://wikitech.wikimedia.org/wiki/Deployments#Thursday.2C.C2.A0November.C2.A003 ? [18:34:57] madhuvishy: so here is the issue H800 is not the new controller it was the old controller that used to be in 2001 and 2002 before then we decided to buy new controller which are the 3ware LSI to replaced the H80 [18:35:01] H800 [18:35:05] thanks thcipriani [18:35:29] MaxSem: the -labs changes will go with the next beta-code-update-eqiad/beta-scap-eqiad cycle [18:35:41] yurik, ^ [18:35:42] (I'll sync them for housekeeping now as well) [18:36:13] thcipriani a comment doesn't warrant scheduling https://wikitech.wikimedia.org/wiki/Deployments/Inclusion_criteria [18:36:13] thanks thcipriani ! [18:36:28] papaul: ah - and how many disks does the 3ware LSI controller support? [18:36:53] madhuvishy: don't know have to check that [18:37:45] arseny92: ah, I see. Could you add it anyway? [18:38:01] madhuvishy: 127 SAS or SATA [18:38:49] anyone feel like reviewing https://gerrit.wikimedia.org/r/#/c/319643/ for me? [18:39:41] papaul: interesting. why did we swap out the H800 in the past with this new one? [18:39:53] !log thcipriani@tin Synchronized wmf-config: SWAT: [[gerrit:319605|LABS: Enable Map (GeoJSON) data on Commons (T149548)]] (housekeeping only sync) (duration: 00m 50s) [18:39:57] added [18:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:00] T149548: Implement GeoJSON shared on-wiki storage - https://phabricator.wikimedia.org/T149548 [18:40:04] madhuvishy: performace i think [18:40:04] arseny92: cool, thanks, looking [18:40:24] because it look that the 3ware is powerfull than the H800 [18:40:29] papaul: oh so where did the older H800 go? [18:41:01] (03CR) 10BBlack: [C: 032] RESTBase: Resurrect WikiData domains [puppet] - 10https://gerrit.wikimedia.org/r/319645 (https://phabricator.wikimedia.org/T149114) (owner: 10Mobrovac) [18:41:12] madhuvishy: when it was replace with the 3ware i had it here at the DC [18:41:31] papaul: also do we have two of the newer 3ware controllers? [18:41:47] madhuvishy: so it is just that the 3ware LSI nerver got configured [18:41:58] madhuvishy: yes we do have 2 [18:41:59] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319644 (owner: 10Arseny1992) [18:42:07] madhuvishy: for 2001 and 2002 [18:42:40] (03Merged) 10jenkins-bot: Fix comment refs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319644 (owner: 10Arseny1992) [18:43:00] madhuvishy: when i login in to the 3ware maanger interface it is missing drivers [18:43:18] hrm, grrrit-wm1 doesn't seem to be logging config change stuffs for some reason [18:43:26] madhuvishy: so we need to get the drivers install frist and go from there [18:43:38] if we want to use the 3ware controllers [18:43:42] btw doing this https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=951481&oldid=951479 breaks the calendar for the week so anyone who did it watch out [18:43:50] 06Operations, 10Traffic, 13Patch-For-Review: Strong cipher preference ordering for cache terminators - https://phabricator.wikimedia.org/T144626#2769467 (10BBlack) The more I dig and study the relevant information that's out there, and especially in light of the invisible impact of the chapoly change on our... [18:44:52] (03PS2) 10BBlack: ssl_ciphersuite: switch AES bits order for GCM [puppet] - 10https://gerrit.wikimedia.org/r/316891 (https://phabricator.wikimedia.org/T144626) [18:45:14] madhuvishy: i have to get some food being a long am i will update the task and see what we can do from there [18:45:32] i thought that too earlier, then it told me about it a minute later [18:45:57] like it was delayed and then a couple changes showed up all at once.. [18:46:11] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:319644|Fix comment refs (T148327)]] (duration: 00m 47s) [18:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:18] T148327: Show changes which was made in last 14 days in watchlist in cswiki by default (for new users) - https://phabricator.wikimedia.org/T148327 [18:46:24] ^ arseny92 change is live everywhere [18:46:28] grrrit-wm1: restart [18:46:39] arseny92: thanks for the comment housekeeping! [18:47:25] robh: ping [18:47:32] mutante it's grrrit-wm: restart [18:47:38] re-connecting to gerrit [18:47:38] reconnected to gerrit [18:47:58] papaul: ? [18:47:59] PROBLEM - puppet last run on labsdb1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:48:08] papaul: alright thank you [18:48:08] robh: do you recall the task for the 3ware LSI controller for labstore2001 and 2002 ? [18:48:20] so it looks for grrrit-wm: *restart, dosent go by when it changes nick [18:48:22] (03PS3) 10Elukey: First Docker prototype [software/varnish/varnishkafka/testing] - 10https://gerrit.wikimedia.org/r/319548 (https://phabricator.wikimedia.org/T147442) [18:48:22] papaul: I'm not even sure what you are referring to, so nope? [18:49:14] thcipriani, its just that Reedy didn't like that being in PS2 of https://gerrit.wikimedia.org/r/#/c/319035/ [18:49:33] mutante :) [18:49:38] So I split it [18:49:42] paladox: maybe in a future version we can fix that, and replace the hardcoded string with something like "$nickname: restart" [18:49:48] Yep [18:50:01] robh: ok thank you i will check in RT [18:50:21] ah, gotcha [18:51:16] That's to me? ;) [18:51:34] thcipriani: so that "restart" command for the bot is there, but it does not kill the IRC connection (good), only the ssh connection it has over to Gerrit [18:51:52] that doesnt mean i know why it stopped :) [18:51:55] I will work on supporting killing the irc connection in a min [18:52:10] should be pretty easy i think. [18:52:11] paladox: don't, we like that it doesnt have to do that :) [18:52:27] mutante i mean another command, so we can change the nick [18:52:37] paladox: ah, ok! [18:52:43] since having it randomly like grrrit-wm1, dosent look nice [18:53:05] and it looses it's wikimedia/bot identifyer [18:53:54] This is blocking the train today: https://gerrit.wikimedia.org/r/#/c/319643/ [18:53:54] PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:54:16] I'll self merge if I have to but I wouldn't mind a code review on ^ before I deploy to group2 later today [18:54:34] ./nickserv identify grrrit-wm password [18:54:34] paladox: yes, better since the command includes the hardcoded nick [18:54:38] twentyafterfour: generally, you can get review on #mediawiki-core when urgent [18:55:00] Yep [18:55:15] heh mediawiki-core is still alive? [18:55:16] with that you id to the account regardless of any nick you have [18:55:47] yeh [18:55:59] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [18:55:59] twentyafterfour: fix it to master, then cherry-pick the branch is better, you don't want to maintain the fix each wmf version [18:56:19] Dereckson: it's cherry-picked to master [18:56:45] https://gerrit.wikimedia.org/r/#/q/Ieb61585af3aa60b7af58597091151d0b494b2fd6 [18:57:20] Our first test with the restart command for prod with grrrit-wm worked. :) [18:57:42] Tested on test bot but things can be different for prod, but it works :) [18:58:24] paladox: i saw, yes [18:58:30] :0 [19:00:04] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161103T1900). Please do the needful. [19:01:10] Could someone look into T149882 ? [19:01:10] T149882: [Bug] TimestampException from line 213 of /srv/mediawiki/php-master/includes/libs/time/ConvertibleTimestamp.php - https://phabricator.wikimedia.org/T149882 [19:02:53] 06Operations, 10ops-eqiad, 10DBA: Multiple hardware issues on db1073 - https://phabricator.wikimedia.org/T149728#2769513 (10jcrespo) [19:03:01] arseny92: that's a dupe [19:03:26] https://gerrit.wikimedia.org/r/#/c/318351/ [19:04:49] uh [19:05:09] ok https://phabricator.wikimedia.org/T149257#2769520 [19:06:25] PROBLEM - HHVM rendering on mw1198 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.003 second response time [19:07:25] RECOVERY - HHVM rendering on mw1198 is OK: HTTP OK: HTTP/1.1 200 OK - 72165 bytes in 0.186 second response time [19:12:32] !log In order to unblock the train for group2: deploying https://gerrit.wikimedia.org/r/#/c/319643/ refs T149059, T149849 [19:13:01] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 07Wikimedia-Incident: Labstore2001 controller or shelf failure - https://phabricator.wikimedia.org/T102626#2769548 (10Papaul) Replaced the bad disk on labstore2002 with one disk from labstore2002. When I received the new disk i will put it into lab... [19:13:18] no stashbot? [19:15:04] (03PS1) 10Yuvipanda: paws_internal: Install statistics related packages [puppet] - 10https://gerrit.wikimedia.org/r/319650 (https://phabricator.wikimedia.org/T149543) [19:15:12] stashbot, help [19:15:24] don't actually know if it responds to messages like that.. [19:16:05] RECOVERY - puppet last run on labsdb1005 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [19:16:16] I don't know but it seems like it might need kicking [19:17:57] !log !log In order to unblock the train for group2: deploying https://gerrit.wikimedia.org/r/#/c/319643/ refs T149059, T149849 [19:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:05] T149059: MW-1.29.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T149059 [19:18:05] T149849: [Bug] fatal PageImages Call to a member function getUrl() on a non-object (boolean) - https://phabricator.wikimedia.org/T149849 [19:18:10] twentyafterfour, fix'd [19:19:04] Krenair: nice, thanks [19:19:27] I just did `sudo become stashbot` and `./bin/stashbot.sh restart` [19:19:41] (03PS2) 10Dzahn: wmde: capitalize resource reference [puppet] - 10https://gerrit.wikimedia.org/r/319635 [19:20:09] thcipriani hi, i think i know why grrrit-wm crashes when you upload your patch that updates wiki's to a new version. [19:20:15] was speaking to twentyafterfour today [19:20:32] about it, and i am testing with a test instance and a test bot [19:21:39] twentyafterfour: going to group2 ? [19:22:33] paladox: awesome! Glad to hear you have a theory. Should be able to test when twentyafterfour rolls forward. [19:22:36] matanya: yes group2 is coming right up [19:23:09] (03CR) 10jenkins-bot: [V: 04-1] wmde: capitalize resource reference [puppet] - 10https://gerrit.wikimedia.org/r/319635 (owner: 10Dzahn) [19:23:48] !log restbase restarting to re-include wikidata domains for T149114 [19:23:52] twentyafterfour: you noticed the small rise in sql connection issues ? [19:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:54] T149114: Reconsider wikidata support in the REST API - https://phabricator.wikimedia.org/T149114 [19:23:55] Yep, thcipriani one problem though i wont get any logs if grrrit-wm crashes [19:24:08] kubenetes deletes the logs for termninated processes [19:24:21] but the test bot isent using kubenetes [19:24:24] so i get a log [19:25:56] matanya: I hadn't noticed, why do you ask? [19:26:54] twentyafterfour: i don't think it is related directly to the version, but it does seem to me there are more : Error: 2013 Lost connection to MySQL server during query (10.64.32.26) [19:27:04] (03PS3) 10Dzahn: wmde: capitalize resource reference [puppet] - 10https://gerrit.wikimedia.org/r/319635 [19:27:05] though i don't have numbers to prove this [19:27:52] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 632 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3052926 keys, up 3 days 11 hours - replication_delay is 632 [19:27:59] that would not usually be related to anything in mediawiki unless we are hammering the sql servers beyond what they can handle [19:28:18] (03PS3) 10Dzahn: admin: update hashar gdbinit script [puppet] - 10https://gerrit.wikimedia.org/r/310794 (owner: 10Hashar) [19:28:50] twentyafterfour: i agree, just pointing out [19:28:56] !log twentyafterfour@tin Synchronized php-1.29.0-wmf.1/extensions/PageImages/includes/ApiQueryPageImages.php: T149849 (duration: 00m 47s) [19:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:02] T149849: [Bug] fatal PageImages Call to a member function getUrl() on a non-object (boolean) - https://phabricator.wikimedia.org/T149849 [19:29:08] (03CR) 10Dzahn: [C: 032] admin: update hashar gdbinit script [puppet] - 10https://gerrit.wikimedia.org/r/310794 (owner: 10Hashar) [19:29:23] although if our queries take longer, we will see this too [19:29:30] so is somewhat software related [19:29:52] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3041089 keys, up 3 days 11 hours - replication_delay is 0 [19:29:53] yeah [19:30:21] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319654 (https://phabricator.wikimedia.org/T146807) [19:30:36] Ok the train appears to be ready to leave the station [19:30:47] hopefully T148251 doesn't blow up in my face [19:30:48] T148251: Empty body in EventBus request - https://phabricator.wikimedia.org/T148251 [19:31:26] !log change-prop deploying f107669 [19:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:12] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp2001 is CRITICAL: connect to address 10.192.0.122 and port 3125: Connection refused [19:32:21] ^ that's me, sorry! :) [19:32:32] PROBLEM - Varnishkafka log producer on cp2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [19:32:32] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp2001 is CRITICAL: connect to address 10.192.0.122 and port 3126: Connection refused [19:32:42] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp2001 is CRITICAL: connect to address 10.192.0.122 and port 3127: Connection refused [19:32:42] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp2001 is CRITICAL: connect to address 10.192.0.122 and port 3122: Connection refused [19:32:42] PROBLEM - Varnish HTTP text-frontend - port 80 on cp2001 is CRITICAL: connect to address 10.192.0.122 and port 80: Connection refused [19:32:42] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp2001 is CRITICAL: connect to address 10.192.0.122 and port 3120: Connection refused [19:32:52] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp2001 is CRITICAL: connect to address 10.192.0.122 and port 3121: Connection refused [19:33:02] PROBLEM - Varnish HTTP text-backend - port 3128 on cp2001 is CRITICAL: connect to address 10.192.0.122 and port 3128: Connection refused [19:33:02] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp2001 is CRITICAL: connect to address 10.192.0.122 and port 3124: Connection refused [19:33:02] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp2001 is CRITICAL: connect to address 10.192.0.122 and port 3123: Connection refused [19:33:07] bleh [19:33:12] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 seconds ago with 2 failures. Failed resources (up to 3 shown) [19:33:12] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp2001 is OK: HTTP OK: HTTP/1.1 200 OK - 320 bytes in 0.072 second response time [19:33:16] I guess I'll have to downtime them [19:33:28] procedure is not as fast as it seems at first glance! [19:33:29] that scared me for a moment ;) [19:33:37] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp2001 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.072 second response time [19:33:42] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp2001 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.072 second response time [19:33:42] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp2001 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.072 second response time [19:33:42] RECOVERY - Varnish HTTP text-frontend - port 80 on cp2001 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.073 second response time [19:33:42] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp2001 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.072 second response time [19:33:52] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp2001 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.072 second response time [19:34:02] RECOVERY - Varnish HTTP text-backend - port 3128 on cp2001 is OK: HTTP OK: HTTP/1.1 200 OK - 178 bytes in 0.074 second response time [19:34:02] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp2001 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.072 second response time [19:34:02] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp2001 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 0.072 second response time [19:34:08] it's depooled regardless [19:34:12] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [19:34:13] just annoying monitoring spam [19:34:32] RECOVERY - Varnishkafka log producer on cp2001 is OK: PROCS OK: 3 processes with command name varnishkafka [19:34:34] twentyafterfour: isn't that already deployed as part of swat anyway? not sure [19:34:37] thought it was [19:35:12] PROBLEM - Apache HTTP on mw1286 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.001 second response time [19:35:12] PROBLEM - HHVM rendering on mw1286 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.001 second response time [19:35:19] ottomata: maybe so, that could explain the jump I saw ;) [19:35:42] PROBLEM - Varnish HTTP text-backend - port 3128 on cp1065 is CRITICAL: connect to address 10.64.0.102 and port 3128: Connection refused [19:36:12] RECOVERY - Apache HTTP on mw1286 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.034 second response time [19:36:12] RECOVERY - HHVM rendering on mw1286 is OK: HTTP OK: HTTP/1.1 200 OK - 72164 bytes in 0.103 second response time [19:38:42] RECOVERY - Varnish HTTP text-backend - port 3128 on cp1065 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.000 second response time [19:38:55] I've parsed that as 'annoying monitoring scam' [19:39:04] :) [19:39:12] twentyafterfour: not sure why you'd see a jump in numbers from that [19:39:16] all it does is add extra logging info [19:39:22] not extra logs or extra events [19:40:32] ottomata: I understand that, it's just the codepath is getting hit more often [19:40:52] ottomata: I may even be wrong about that, but it did look like an increase yesterday (unrelated to anything you changed) [19:41:31] I think that when the jobrunners got bumped to wmf.1 then that increased the frequency of hits to the codepath that emits that message [19:41:47] twentyafterfour: that is possible, but it also might just be in flux with the number of links that get updated at any given time [19:41:55] but, could be! [19:42:03] yeah I'm just watching it, not too concerned [19:42:06] (03PS2) 10Dzahn: contint: remove gallium from firewall::labs [puppet] - 10https://gerrit.wikimedia.org/r/318245 (https://phabricator.wikimedia.org/T95757) [19:42:20] we think that json is not encoding properly, so there might be something that has been added to the objects that the linksupdate hook passes [19:42:30] that causes the php array to not serialize to json well [19:42:31] we think...anyway [19:42:35] really aren't sure [19:42:37] aye cool [19:42:38] thanks [19:44:32] twentyafterfour: spike in metawiki fatal now [19:46:40] matanya: I haven't pushed to group2 yet [19:47:12] meta is group1 twentyafterfour [19:47:38] and it is all central notice, so not a blocker anyway [19:47:50] I also haven't touched anything on group1 ;) [19:51:01] ottomata: this is what concerns me: https://goo.gl/qLizso [19:51:14] > 99% of the messages come from wmf.1 [19:55:53] (03PS1) 10Yuvipanda: statistics: Move packages to own class [puppet] - 10https://gerrit.wikimedia.org/r/319663 [19:57:45] (03PS2) 10Yuvipanda: statistics: Move packages to own class [puppet] - 10https://gerrit.wikimedia.org/r/319663 [19:58:02] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 07Wikimedia-Incident: Labstore2001 controller or shelf failure - https://phabricator.wikimedia.org/T102626#2769662 (10madhuvishy) @Papaul Thanks for the summary. As far as I'm aware all the labstore boxes, and majority of the rest prod storage se... [19:59:11] (03PS3) 10Yuvipanda: statistics: Move packages to own class [puppet] - 10https://gerrit.wikimedia.org/r/319663 [20:00:06] killed grrrit-wm again :-/ [20:00:11] paladox: ^ [20:00:40] (03Merged) 10jenkins-bot: all wikis to 1.29.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319664 (owner: 1020after4) [20:00:42] (03CR) 10Yuvipanda: [C: 032] statistics: Move packages to own class [puppet] - 10https://gerrit.wikimedia.org/r/319663 (owner: 10Yuvipanda) [20:01:30] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.29.0-wmf.1 [20:01:30] here goes wmf.1 [20:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:03] seeing a lot of db errors indeed. "Server db1071 (#4) is not replicating?" and "Server db1051 (#1) has >= 6.6266131401062 seconds of lag" [20:08:16] PROBLEM - puppet last run on notebook1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:09:02] !log codfw cache_text - all pooled nodes are v4 (2x still depooled-but-upgraded) - T131503 [20:09:39] 06Operations, 06Labs, 07Tracking: Add config option in tools webservice debian package to write logs to /dev/null - https://phabricator.wikimedia.org/T149946#2769686 (10madhuvishy) [20:09:50] twentyafterfour: it might be version related after all ? [20:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:23] T131503: Convert text cluster to Varnish 4 - https://phabricator.wikimedia.org/T131503 [20:10:58] matanya: I'm not sure [20:11:11] I didn't see a spike after pushing to group2 [20:11:16] it's just a lot of errors [20:11:36] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.352 second response time [20:12:36] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.736 second response time [20:14:26] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.715 second response time [20:14:44] twentyafterfour ok [20:14:45] thanks [20:14:46] sorry for late response - dinner [20:14:47] but im back [20:15:26] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.915 second response time [20:16:14] (03CR) 10Dzahn: [C: 032] "This was the special case for the ssd based on gallium's hostname. Even if we still neeed it (hopefully not), this will not umount it or a" [puppet] - 10https://gerrit.wikimedia.org/r/318217 (https://phabricator.wikimedia.org/T95757) (owner: 10Dzahn) [20:17:25] (03PS2) 10Dzahn: contint: remove gallium conditional from contint::master_dir [puppet] - 10https://gerrit.wikimedia.org/r/318217 (https://phabricator.wikimedia.org/T95757) [20:18:46] (03CR) 10Dzahn: "additionally, puppet is disabled on gallium now anyways" [puppet] - 10https://gerrit.wikimedia.org/r/318217 (https://phabricator.wikimedia.org/T95757) (owner: 10Dzahn) [20:19:00] (03PS3) 10Dzahn: contint: remove gallium conditional from contint::master_dir [puppet] - 10https://gerrit.wikimedia.org/r/318217 (https://phabricator.wikimedia.org/T95757) [20:20:30] (03PS5) 10Gehel: maps - create postgresql database for tiles storage [puppet] - 10https://gerrit.wikimedia.org/r/318954 (https://phabricator.wikimedia.org/T147223) [20:23:28] 07Puppet, 06Labs, 10wikitech.wikimedia.org: Puppet failure e-mail from labs/wikitech contains wrong url to wikitech - https://phabricator.wikimedia.org/T149883#2769728 (10Krenair) 05Open>03Invalid Looks fine to me in puppet, https://gerrit.wikimedia.org/r/#/c/271978/ would've fixed this if that instance... [20:25:56] (03PS6) 10Gehel: maps - create postgresql database for tiles storage [puppet] - 10https://gerrit.wikimedia.org/r/318954 (https://phabricator.wikimedia.org/T147223) [20:26:27] !log codfw cache_text - all nodes v4 and pooled - T131503 [20:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:33] T131503: Convert text cluster to Varnish 4 - https://phabricator.wikimedia.org/T131503 [20:28:19] (03CR) 10Gehel: [V: 032] maps - create postgresql database for tiles storage [puppet] - 10https://gerrit.wikimedia.org/r/318954 (https://phabricator.wikimedia.org/T147223) (owner: 10Gehel) [20:30:35] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 07Wikimedia-Incident: Labstore2001 controller or shelf failure - https://phabricator.wikimedia.org/T102626#2769735 (10Papaul) It is your call i don't know anything on the lab setup so anything your want for the box I will follow. [20:32:30] greg-g: may want to mention labs-only&no-ops depolys, they always cause drama with the "no depolys" wording [20:32:34] thcipriani twentyafterfour it is this command git push origin "HEAD:refs/for/master/${VERSION}%l=Code-Review+2" that it crashing the bot [20:32:36] just tested [20:32:42] i can get a log now [20:32:57] info: Sent message from labs/tools to #wikimedia-bot-gerrit [20:32:57] var inlineCount = message.comment.match(/(?:^|\s)\((\d+) comments?\)(?:$|\s)/), [20:32:57] ^ [20:32:57] TypeError: Cannot read property 'match' of undefined [20:33:43] !log T133395: Enabling unchecked_tombstone_compaction and setting tombstone_threshold = .6 on "local_group_wikipedia_T_parsoid_html".data [20:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:52] T133395: Evaluate TimeWindowCompactionStrategy - https://phabricator.wikimedia.org/T133395 [20:34:12] no longer a theiry [20:34:49] 07Puppet, 06Labs, 10wikitech.wikimedia.org: Puppet failure e-mail from labs/wikitech contains wrong url to wikitech - https://phabricator.wikimedia.org/T149883#2769745 (10Krenair) Okay, I just literally logged in, found a `/var/lib/puppet/state/agent_catalog_run.lock` file from March, deleted it, ran puppet,... [20:34:50] paladox: makes sense, I suppose, just need to make sure message.comment is defined :) [20:35:02] it is though [20:35:41] twentyafterfour: still following DB errors ? [20:35:43] full log https://phabricator.wikimedia.org/P4365 [20:36:18] may a try and catch may work [20:36:33] PROBLEM - puppet last run on notebook1002 is CRITICAL: CRITICAL: Puppet has 12 failures. Last run 2 minutes ago with 12 failures. Failed resources (up to 3 shown): Package[myspell-tn],Package[myspell-ve],Package[myspell-xh],Package[myspell-nr] [20:36:47] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 07Wikimedia-Incident: Labstore2001 controller or shelf failure - https://phabricator.wikimedia.org/T102626#2769747 (10madhuvishy) @Papaul I wanted to make sure our current thought process is documented. If the H800's work with our storage shelves a... [20:37:14] lets try [20:37:54] Oh [20:38:04] it's in exports['comment-added'] = function(message) { [20:51:52] (03CR) 10Hashar: [C: 031] "Seen that error in the doc generation as well :)" [puppet] - 10https://gerrit.wikimedia.org/r/319635 (owner: 10Dzahn) [20:53:55] matanya: I didn't see any spikes in db errors [20:54:01] ok [20:58:49] heading out, back in 2 hours for my swat thing [21:00:23] (03PS1) 10Hashar: Remove zuul.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/319675 [21:01:01] (03CR) 10Hashar: "That is no more needed since the migration to contint1001 :]" [dns] - 10https://gerrit.wikimedia.org/r/319675 (owner: 10Hashar) [21:01:24] PROBLEM - HHVM rendering on mw1203 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.002 second response time [21:01:45] 06Operations, 06Performance-Team, 10Thumbor: thumbor imagemagick filling up /tmp on thumbor1002 - https://phabricator.wikimedia.org/T145878#2769866 (10Gilles) I think I'll go read the sytemd code and/or write a test app creating junk files. The systemd docs mentioned the feature is exactly the same as setrli... [21:02:24] RECOVERY - HHVM rendering on mw1203 is OK: HTTP OK: HTTP/1.1 200 OK - 72194 bytes in 0.155 second response time [21:12:53] 06Operations, 10Continuous-Integration-Infrastructure, 07Nodepool, 13Patch-For-Review: Clean up apt:pin of python modules used for Nodepool - https://phabricator.wikimedia.org/T137217#2361186 (10hashar) [21:15:43] (03PS1) 10BryanDavis: wikitech: remove 'bots' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319683 [21:17:37] 06Operations, 06Performance-Team, 10Thumbor: thumbor imagemagick filling up /tmp on thumbor1002 - https://phabricator.wikimedia.org/T145878#2769974 (10ori) >>! In T145878#2769866, @Gilles wrote: > The systemd docs mentioned the feature is exactly the same as setrlimit. It feels safer to me to set that at the... [21:19:28] 06Operations, 10Continuous-Integration-Infrastructure, 07Nodepool, 13Patch-For-Review: Clean up apt:pin of python modules used for Nodepool - https://phabricator.wikimedia.org/T137217#2769982 (10hashar) I have poked the ops-l internal mailling list to get this scheduled. [21:21:23] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 07Wikimedia-Incident: Labstore2001 controller or shelf failure - https://phabricator.wikimedia.org/T102626#2769993 (10Papaul) @madhuvishy @chasemp please see th link below for why H800 needed replacement. https://rt.wikimedia.org/Ticket/Display.htm... [21:32:49] (03CR) 10Filippo Giunchedi: "One side effect of mtail on the central syslog servers is that this metric is also indirectly available in graphite as 'mtail.lithium.kern" [puppet] - 10https://gerrit.wikimedia.org/r/315272 (https://phabricator.wikimedia.org/T148962) (owner: 10Gilles) [21:34:21] (03PS16) 10Zppix: Add support for searching gerrit using bug:T1 [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) (owner: 10Paladox) [21:34:38] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:39:27] 06Operations, 07Puppet, 10Continuous-Integration-Config, 07Upstream: post build failures for operations/puppet on operations-puppet-doc - https://phabricator.wikimedia.org/T143233#2770145 (10hashar) The [[ https://integration.wikimedia.org/ci/job/operations-puppet-doc/ | operations-puppet-doc ]] no more tr... [21:42:48] (03PS1) 10Yuvipanda: statistics: Bring in R debs from upstream [puppet] - 10https://gerrit.wikimedia.org/r/319700 (https://phabricator.wikimedia.org/T149949) [21:43:06] (03CR) 10jenkins-bot: [V: 04-1] statistics: Bring in R debs from upstream [puppet] - 10https://gerrit.wikimedia.org/r/319700 (https://phabricator.wikimedia.org/T149949) (owner: 10Yuvipanda) [21:43:18] PROBLEM - puppet last run on californium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:43:27] (03PS2) 10Yuvipanda: statistics: Bring in R debs from upstream [puppet] - 10https://gerrit.wikimedia.org/r/319700 (https://phabricator.wikimedia.org/T149949) [21:44:45] (03CR) 10jenkins-bot: [V: 04-1] statistics: Bring in R debs from upstream [puppet] - 10https://gerrit.wikimedia.org/r/319700 (https://phabricator.wikimedia.org/T149949) (owner: 10Yuvipanda) [21:49:07] (03PS3) 10Yuvipanda: statistics: Bring in R debs from upstream [puppet] - 10https://gerrit.wikimedia.org/r/319700 (https://phabricator.wikimedia.org/T149949) [21:54:48] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:59:38] PROBLEM - HHVM rendering on mw1191 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.002 second response time [22:00:15] thcipriani i think i have fixed it now, and will roll it out to the grrrit-wm bot [22:00:28] When you do git push origin "HEAD:refs/for/master/${VERSION}%l=Code-Review+2" [22:00:38] RECOVERY - HHVM rendering on mw1191 is OK: HTTP OK: HTTP/1.1 200 OK - 72193 bytes in 0.131 second response time [22:00:44] it will imeditly do +2, which wont show on the bot but it will show you uploaded the patch [22:00:58] cool! [22:01:41] i belive this may be a bug in gerrit since gerrit should not be using the comment-added stream for +2 [22:02:37] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [22:05:24] https://gerrit.wikimedia.org/r/#/c/319705/ [22:12:17] RECOVERY - puppet last run on californium is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [22:16:31] (03PS1) 10Hashar: Do not backup /srv/jenkins/builds on contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/319710 [22:20:01] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, and 2 others: Improve Elasticsearch icinga alerting - https://phabricator.wikimedia.org/T133844#2770332 (10Deskana) @ebernhardson does not see as many alerts any more. @gehel Do you consider this resolved? If not, can you give details on some... [22:21:09] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch: Collect metrics on pool counter usage - https://phabricator.wikimedia.org/T130617#2770347 (10Deskana) p:05Normal>03Low @EBernhardson thinks this is nice to have but not particularly important; decreasing in priority accordingly. [22:21:55] (03CR) 10Hashar: [C: 04-1] "Not ready!" [puppet] - 10https://gerrit.wikimedia.org/r/312523 (https://phabricator.wikimedia.org/T146381) (owner: 10Hashar) [22:21:59] (03CR) 10Hashar: "Not ready!" [puppet] - 10https://gerrit.wikimedia.org/r/311959 (owner: 10Hashar) [22:22:12] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch: Have dedicated master nodes for elasticsearch - https://phabricator.wikimedia.org/T130590#2140227 (10Deskana) @gehel Not much discussion has happened here. Is this something that still needs discussion, is now actionable, or is this stale and... [22:22:47] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [22:23:12] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, and 2 others: Improve Elasticsearch icinga alerting - https://phabricator.wikimedia.org/T133844#2770377 (10Gehel) The first part of checking cluster state against the service address (search.svc...) is done, but the check of aliases is not the... [22:23:37] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, and 2 others: Improve Elasticsearch icinga alerting - https://phabricator.wikimedia.org/T133844#2247220 (10Deskana) [22:26:35] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch: Have dedicated master nodes for elasticsearch - https://phabricator.wikimedia.org/T130590#2770391 (10Gehel) This is definitely not implemented yet. And yes, it would most probably make sense to do it. [22:27:57] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch: Have dedicated master nodes for elasticsearch - https://phabricator.wikimedia.org/T130590#2770393 (10Deskana) >>! In T130590#2770391, @Gehel wrote: > This is definitely not implemented yet. And yes, it would most probably make sense to do it.... [22:29:02] (03CR) 10Dzahn: [C: 032] Do not backup /srv/jenkins/builds on contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/319710 (owner: 10Hashar) [22:31:31] thcipriani twentyafterfour i have deployed the fix temporarily and will change it depending on the reviews the change get [22:31:46] this should hopefully fix it now, (stop crashing ). [22:40:15] (03PS3) 10Dzahn: zuul-gearman-status.py requires python-gear [puppet] - 10https://gerrit.wikimedia.org/r/319612 (owner: 10Hashar) [22:47:08] (03PS1) 10Andrew Bogott: Designate nova_fixed_multi plugin: avoid race conditions [puppet] - 10https://gerrit.wikimedia.org/r/319759 (https://phabricator.wikimedia.org/T115194) [22:47:18] (03CR) 10Dzahn: [C: 032] zuul-gearman-status.py requires python-gear [puppet] - 10https://gerrit.wikimedia.org/r/319612 (owner: 10Hashar) [22:49:00] (03PS2) 10Andrew Bogott: Designate nova_fixed_multi plugin: avoid race conditions [puppet] - 10https://gerrit.wikimedia.org/r/319759 (https://phabricator.wikimedia.org/T115194) [22:49:28] PROBLEM - MegaRAID on db1051 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [22:49:29] ACKNOWLEDGEMENT - MegaRAID on db1051 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T149964 [22:49:33] 06Operations, 10ops-eqiad: Degraded RAID on db1051 - https://phabricator.wikimedia.org/T149964#2770452 (10ops-monitoring-bot) [22:49:53] nice @nagiosadmin auto-ack :) [22:50:29] i already created [22:50:48] _it_ already created [22:50:56] T149964 [22:50:57] T149964: Degraded RAID on db1051 - https://phabricator.wikimedia.org/T149964 [22:51:27] pretty sweet that it works now [22:54:11] :) [22:56:18] PROBLEM - HHVM rendering on mw1277 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.002 second response time [22:56:30] PROBLEM - Apache HTTP on mw1277 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.002 second response time [22:57:18] RECOVERY - HHVM rendering on mw1277 is OK: HTTP OK: HTTP/1.1 200 OK - 72184 bytes in 0.102 second response time [22:57:28] RECOVERY - Apache HTTP on mw1277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.021 second response time [23:00:05] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161103T2300). Please do the needful. [23:00:05] jan_drewniak and ottomata: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:24] yeah im' here! [23:01:54] o/ [23:03:30] (03PS4) 10Dzahn: contint: drop contint::packages [puppet] - 10https://gerrit.wikimedia.org/r/317988 (owner: 10Hashar) [23:03:35] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/4542/" [puppet] - 10https://gerrit.wikimedia.org/r/317988 (owner: 10Hashar) [23:03:54] I can SWAT [23:04:53] (03PS2) 10Thcipriani: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319654 (https://phabricator.wikimedia.org/T146807) (owner: 10Jdrewniak) [23:04:59] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319654 (https://phabricator.wikimedia.org/T146807) (owner: 10Jdrewniak) [23:05:31] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319654 (https://phabricator.wikimedia.org/T146807) (owner: 10Jdrewniak) [23:07:07] jan_drewniak: your changes are live on mw1099, check please [23:07:14] (03CR) 10Alex Monk: [C: 031] wikitech: remove 'bots' group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319683 (owner: 10BryanDavis) [23:07:58] thcipriani: yup, looks good [23:08:06] jan_drewniak: ok, going live. [23:10:10] !log thcipriani@tin Synchronized portals/prod/wikipedia.org/assets: SWAT: [[gerrit:319654|Bumping portals to master (T146807)]] (duration: 00m 51s) [23:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:15] T146807: Wikipedia/Wikimedia apps availability test: analyze results - https://phabricator.wikimedia.org/T146807 [23:11:03] !log thcipriani@tin Synchronized portals: SWAT: [[gerrit:319654|Bumping portals to master (T146807)]] (duration: 00m 52s) [23:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:13] ^ jan_drewniak live everywhere [23:11:24] thcipriani: thanks! [23:12:20] bd808, what was decided re. l10nupdate and the scap touch of IS? [23:12:42] kill l10nupdate? [23:13:07] really? [23:13:11] twentyafterfour: I just pulled down a change for PageImages for wmf.1 that looks like yours? [23:13:18] +100000 to killing that thing [23:13:30] But prolly needs an RfC [23:13:36] have you talked to some i18n people about how they feel about that idea? [23:14:53] Not in about 2 years. [23:14:56] Hence rfc. [23:15:01] heh [23:15:15] As we do branch/deploy regularly... I think it's less important [23:15:17] fwiw, there is a patch for scap so that it doesn't touch IS as long as you use a flag: https://phabricator.wikimedia.org/D435 [23:15:55] Has anyone started a discussion at least? :P [23:16:03] Reedy: Indeed. It came from stupidier days when we branched every 3-6 months. [23:16:13] (03PS4) 10Dzahn: wmde: capitalize resource reference [puppet] - 10https://gerrit.wikimedia.org/r/319635 [23:16:14] yeah but the flag somehow got renamed as 'beta-only-change' [23:16:19] which implies it won't be used by l10nupdate [23:16:31] thcipriani: I deployed that change for pageimages .... [23:16:38] (03PS5) 10Dzahn: statistics/wmde: capitalize resource reference [puppet] - 10https://gerrit.wikimedia.org/r/319635 [23:16:49] I event tested before and after [23:16:50] (03CR) 10Dzahn: [C: 032] statistics/wmde: capitalize resource reference [puppet] - 10https://gerrit.wikimedia.org/r/319635 (owner: 10Dzahn) [23:16:54] twentyafterfour: could you jump on tin and check it out? [23:17:03] thcipriani: ok [23:17:07] currently fetched for wmf.1, but not rebased. [23:17:38] thcipriani: ok I see what it is, you just pulled the submodule bump to core [23:17:41] I think [23:17:59] because I just pulled the change into the submodule directly [23:18:13] ahh, yeah, gerrit does the magic bump for you in core [23:18:26] ok, so long as it's a no op [23:18:31] yep should be fine [23:18:37] twentyafterfour: thanks :) [23:18:41] * thcipriani continues SWAT [23:19:59] ottomata: your eventbus update in live on mw1099, check please [23:20:14] also, patch for wmf.23 not needed since wmf.1 is live everywhere [23:20:30] ok checking... [23:20:32] ok [23:21:19] thcipriani: looks good [23:21:27] ottomata: ok, going live [23:21:41] i can't repro the error logging handling live, but i can confirm that normal use works [23:21:49] cool, watching logs [23:21:52] for eventbus [23:23:33] !log thcipriani@tin Synchronized php-1.29.0-wmf.1/extensions/EventBus/EventBus.php: SWAT: [[gerrit:319661|Add logging and check for empty JSON encoded body (T148251)]] (duration: 00m 47s) [23:23:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:39] T148251: Empty body in EventBus request - https://phabricator.wikimedia.org/T148251 [23:23:39] ^ ottomata live [23:23:50] k [23:25:11] thcipriani: so far so good, errors have stopped coming into eventbus service [23:25:17] waiting for logstash to capture the error logs from mw [23:25:22] :D nice [23:25:28] PROBLEM - Apache HTTP on mw1207 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.001 second response time [23:25:41] might take a bit i guess then, cause those are jobs in jobqueue [23:26:28] RECOVERY - Apache HTTP on mw1207 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.031 second response time [23:27:29] (03PS1) 10Catrope: Disable Flow on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319761 (https://phabricator.wikimedia.org/T148611) [23:27:55] (03CR) 10Catrope: [C: 04-2] "On hold until we finish exports" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319761 (https://phabricator.wikimedia.org/T148611) (owner: 10Catrope) [23:35:59] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 07Wikimedia-Incident: Labstore2001 controller or shelf failure - https://phabricator.wikimedia.org/T102626#2770555 (10madhuvishy) Thanks for the ticket @papaul. Helped me understand a bit of the history. What I understand so far - The ticket claim... [23:39:47] there should be no more crashes with grrrit-wm when you are updating the wiki's :) [23:40:00] twentyafterfour fixed it, ive deployed it now ^^ :) [23:40:03] thcipriani ^^ [23:40:18] yay! thanks paladox [23:40:24] your weclome :) [23:41:29] grrrit-wm: restart [23:41:29] re-connecting to gerrit [23:41:30] reconnected to gerrit [23:47:24] :) [23:54:37] PROBLEM - HHVM rendering on mw1230 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.001 second response time [23:55:37] RECOVERY - HHVM rendering on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 72165 bytes in 0.417 second response time [23:58:24] (03PS1) 10Yuvipanda: tools: Properly give the clush user sudo rights [puppet] - 10https://gerrit.wikimedia.org/r/319767