[00:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160204T0000). Please do the needful. [00:00:20] too busy today to handle it [00:00:27] ah but I see no one signed any patches up for it anyway [00:01:03] in lieu of normal swat, rolling wmf.12 to group1 wikis to get us back on schedule. [00:01:57] w00t [00:01:58] (03CR) 10Subramanya Sastry: "From the logs:" [puppet] - 10https://gerrit.wikimedia.org/r/268326 (owner: 10Subramanya Sastry) [00:02:15] (03PS1) 10Thcipriani: group1 wikis to 1.27.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268327 [00:03:19] 6operations: setup/deploy sarin(WMF5851) as a salt master in codfw - https://phabricator.wikimedia.org/T125752#1996363 (10RobH) 3NEW a:3RobH [00:03:27] (03CR) 10Thcipriani: [C: 032] group1 wikis to 1.27.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268327 (owner: 10Thcipriani) [00:03:32] 6operations, 10ops-codfw, 10Salt, 10hardware-requests: allocate hardware for salt master in codfw - https://phabricator.wikimedia.org/T123559#1932440 (10RobH) [00:03:34] 6operations: setup/deploy sarin(WMF5851) as a salt master in codfw - https://phabricator.wikimedia.org/T125752#1996373 (10RobH) [00:03:38] 6operations, 6Labs, 10Labs-Infrastructure, 10Tool-Labs: failed backups on labstore? - https://phabricator.wikimedia.org/T125749#1996375 (10Dzahn) - /usr/local/bin/nrpe_check_systemd_unit_state on labstore is puppetized, but the check_replicate commands and NRPE config seems to be missing - check_systemd... [00:03:53] (03Merged) 10jenkins-bot: group1 wikis to 1.27.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268327 (owner: 10Thcipriani) [00:04:04] 6operations: setup/deploy sarin(WMF5851) as a salt master in codfw - https://phabricator.wikimedia.org/T125752#1996363 (10RobH) [00:04:18] 6operations, 10Salt: setup/deploy sarin(WMF5851) as a salt master in codfw - https://phabricator.wikimedia.org/T125752#1996363 (10RobH) [00:04:23] !log thcipriani@mira rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.27.0-wmf.12 [00:04:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:05:12] (03PS1) 10Jdlrobson: Make mobile-beta an available platform [puppet] - 10https://gerrit.wikimedia.org/r/268329 [00:05:14] (03PS2) 10Subramanya Sastry: parsoid-vd-client: Add missing PATH and PROXY env vars to script [puppet] - 10https://gerrit.wikimedia.org/r/268326 [00:05:48] 6operations, 10Salt: setup/deploy sarin(WMF5851) as a salt master in codfw - https://phabricator.wikimedia.org/T125752#1996385 (10RobH) [00:05:50] 6operations, 5Patch-For-Review: jessie installer fails after partitioning stage- same recipe works on trusty and a it worked few weeks ago - https://phabricator.wikimedia.org/T125256#1996386 (10RobH) [00:06:46] (03CR) 10jenkins-bot: [V: 04-1] Make mobile-beta an available platform [puppet] - 10https://gerrit.wikimedia.org/r/268329 (owner: 10Jdlrobson) [00:09:05] (03PS2) 10Jdlrobson: Make mobile-beta an available platform [puppet] - 10https://gerrit.wikimedia.org/r/268329 [00:11:02] 6operations, 6Labs, 10Labs-Infrastructure, 10Tool-Labs: failed backups on labstore? - https://phabricator.wikimedia.org/T125749#1996400 (10Dzahn) nevermind, the script actually just gets this from systemctl like this: /bin/systemctl show replicate-maps | grep Result Result=exit-code This is what is execu... [00:11:36] 6operations, 10ops-codfw: apply hostname label and update racktables entry for sarin (WMF5851) - https://phabricator.wikimedia.org/T125753#1996402 (10RobH) 3NEW a:3Papaul [00:11:42] 6operations, 6Labs, 10Labs-Infrastructure, 10Tool-Labs: labstore - replication to codfw broken or not working yet - https://phabricator.wikimedia.org/T125749#1996409 (10Dzahn) [00:12:24] ACKNOWLEDGEMENT - Last backup of the maps filesystem on labstore1001 is CRITICAL: CRITICAL - Last run result for unit replicate-maps was exit-code daniel_zahn https://phabricator.wikimedia.org/T125749 [00:12:24] ACKNOWLEDGEMENT - Last backup of the others filesystem on labstore1001 is CRITICAL: CRITICAL - Last run result for unit replicate-others was exit-code daniel_zahn https://phabricator.wikimedia.org/T125749 [00:12:24] ACKNOWLEDGEMENT - Last backup of the tools filesystem on labstore1001 is CRITICAL: CRITICAL - Last run result for unit replicate-tools was exit-code daniel_zahn https://phabricator.wikimedia.org/T125749 [00:13:16] (03CR) 10Dzahn: [C: 032] "$::lsbdistcodename is all over the place" [puppet] - 10https://gerrit.wikimedia.org/r/266967 (owner: 10Dzahn) [00:16:08] 6operations, 10OTRS, 6Security, 7HTTPS: SSL-config of the OTRS is outdated - https://phabricator.wikimedia.org/T91504#1996417 (10Dzahn) we should check one more time since today OTRS moved to mendelevium. i think we already fixed everything except the DNSSEC and DANE part before though, as Jan Zerebecki s... [00:19:36] 6operations, 10OTRS, 7HTTPS: ssl certificate replacement: ticket.wikimedia.org (expires 2016-02-16) - https://phabricator.wikimedia.org/T122320#1996427 (10Dzahn) The OTRS upgrade has happened. I think this can follow soon. [00:20:00] 6operations, 10ops-codfw: onsite setup for sarin (WMF5851) - https://phabricator.wikimedia.org/T125753#1996428 (10RobH) [00:20:30] 6operations, 10Salt: setup/deploy sarin(WMF5851) as a salt master in codfw - https://phabricator.wikimedia.org/T125752#1996363 (10RobH) [00:21:17] 6operations, 10Salt: setup/deploy sarin(WMF5851) as a salt master in codfw - https://phabricator.wikimedia.org/T125752#1996363 (10RobH) This system hasn't been allocated since moving from Tampa, so it needs its drac setup done for initial remote access. All of the onsite steps are detailed on T125753. Once t... [00:25:43] 6operations, 10OTRS, 6Security, 7HTTPS: SSL-config of the OTRS is outdated - https://phabricator.wikimedia.org/T91504#1996450 (10Dzahn) https://www.ssllabs.com/ssltest/analyze.html?d=ticket.wikimedia.org grade A @DaBPunkt i would like to claim this is resolved. All the main things you listed when this w... [00:26:00] 6operations, 7HTTPS: Add Forward Secrecy to all HTTPS sites - https://phabricator.wikimedia.org/T55259#1996453 (10Dzahn) [00:26:03] PROBLEM - Ubuntu mirror in sync with upstream on carbon is CRITICAL: /srv/mirrors/ubuntu is over 12 hours old. [00:26:03] 6operations, 10OTRS, 6Security, 7HTTPS: SSL-config of the OTRS is outdated - https://phabricator.wikimedia.org/T91504#1996451 (10Dzahn) 5Open>3Resolved a:3Dzahn [00:29:44] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#1996458 (10Yurik) Benchmarking [[ https://ganglia.wikimedia.org/latest/?r=custom&cs=02%2F03%2F2016+23%3A57&ce=02%2F04%2F2016+00%3A04&m=cpu_report&c=Maps+Cluster+codfw&h=ma... [00:35:31] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [00:41:51] !log yuvipanda@labstore2001:~$ sudo lvremove backup/tools20160121020007 [00:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:42:15] 6operations, 10MediaWiki-Configuration: search.wikimedia.org fails to load - https://phabricator.wikimedia.org/T125755#1996492 (10Mattflaschen) 3NEW [00:42:30] fyi about to do a deployment of a parser change revert [00:42:36] should not affect group2 wikis [00:42:40] !log yuvipanda@labstore2001:~$ sudo lvremove backup/maps20160121040005 [00:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:43:32] headsup, there's a phabricator maintenance window starting in 20 mins, during which iridium will be rebooted [00:43:38] 6operations, 6Labs, 10Labs-Infrastructure, 10Tool-Labs: labstore - replication to codfw broken or not working yet - https://phabricator.wikimedia.org/T125749#1996500 (10yuvipanda) Old snapshot on labstore2001 had gotten full, causing lvs to fail, causing the backup script to fail. I've cleaned them out on... [00:44:18] 6operations: search.wikimedia.org fails to load - https://phabricator.wikimedia.org/T125755#1996501 (10Krenair) [00:44:22] RECOVERY - Ubuntu mirror in sync with upstream on carbon is OK: /srv/mirrors/ubuntu is over 0 hours old. [00:45:32] (03PS2) 10Dzahn: ldap: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266969 [00:46:10] (03CR) 10Dzahn: [C: 032] ldap: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266969 (owner: 10Dzahn) [01:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160204T0100). [01:00:35] !log rebooting iridium (phabricator host) for kernel update [01:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:01:01] :) [01:01:09] Could someone check if there are currently mail delivery issues? See https://phabricator.wikimedia.org/T125756 -- it seems mail sent through OTRS isn't getting through at the moment. [01:01:53] !log krenair@mira Synchronized php-1.27.0-wmf.12/includes/parser: https://gerrit.wikimedia.org/r/#/c/268332/ (duration: 02m 25s) [01:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:02:00] (03CR) 10Mobrovac: [C: 031] cassandra: fix top-scope vars without namespaces [puppet] - 10https://gerrit.wikimedia.org/r/266975 (owner: 10Dzahn) [01:03:39] twentyafterfour: are you updating phab or did it just decide to take a nap? [01:03:51] pajz: cant look at the link right now, but there has been an OTRS upgrade and server switch today [01:03:56] * yurik switches to github [01:03:57] bd808: moritz just restarted [01:04:07] bd808: kernel update [01:04:10] ah I see the !log now [01:04:17] phabricator is back up [01:04:52] !log krenair@mira Synchronized php-1.27.0-wmf.12/tests: https://gerrit.wikimedia.org/r/#/c/268332/ (duration: 02m 08s) [01:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:05:22] PROBLEM - PyBal backends health check on lvs1012 is CRITICAL: PYBAL CRITICAL - git-ssh6_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down! [01:05:31] heh [01:05:51] iridium-vcs.. iridium is phab [01:06:13] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - git-ssh6_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down! [01:06:31] mutante, hmm, yes, though it has worked after the upgrade at some point. [01:07:11] RECOVERY - PyBal backends health check on lvs1012 is OK: PYBAL OK - All pools are healthy [01:08:01] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [01:08:12] PROBLEM - PHD should be supervising processes on iridium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 997 (phd) [01:08:23] pajz: looking, i do see still things happening in the mail logs thre [01:08:44] fixed phd [01:08:55] thanks, got SMS [01:09:34] pajz: could you maybe try to send one now? [01:09:47] yeah, sec [01:10:22] PROBLEM - puppet last run on iridium is CRITICAL: CRITICAL: Puppet has 1 failures [01:10:50] mutante, same error [01:11:33] pajz: i see activity.. and lots of mail being detected as spam and then being discarded [01:11:54] but maybe not yours [01:12:30] outbound mail detected as spam? [01:13:14] just mail in general [01:13:19] i found something else thoug [01:13:33] pajz: one more time please? [01:14:30] pajz: was that to a "martin"? [01:14:46] same error [01:15:09] no, to 'patrik@...' [01:15:28] uh, odd. i have an error but that doesnt match it [01:15:39] Message: 'Ticket create notification' notification could not be sent to agent [01:16:01] from info-de-v@wikimedia.org to my private email address [01:16:12] 6operations: search.wikimedia.org fails to load - https://phabricator.wikimedia.org/T125755#1996587 (10Krenair) 5Open>3Invalid a:3Krenair Yeah, I think this is what it's supposed to do - https://en.wikipedia.org/w/api.php?action=opensearch&search=&limit=99 returns no results. https://search.wikimedia.org/?... [01:17:13] RECOVERY - PHD should be supervising processes on iridium is OK: PROCS OK: 17 processes with UID = 997 (phd) [01:20:30] ^not sure why this happened, I looked and other than the slew of normal phd errors for repo processing [01:20:31] seems ok [01:23:37] chasemp: it happened because puppet first run after boot screws up my scap3 deployment tags. We need a way to make puppet only check out the tags if the repos don't exist and otherwise leave well enough alone. [01:24:02] ok thanks for the explanation I was puzzled [01:24:30] the same thing happened last time iridium rebooted. The thing that I don't understand is, why it happens on first puppet run after boot? [01:24:46] Is the puppet lock file on tmpfs? [01:27:46] * twentyafterfour bets that's what it is [01:27:52] idk it runs via cron or @reboot but I think teh the logic is in puppet-run [01:28:00] in /usr/local/bin if you want to look [01:43:04] (03PS1) 1020after4: Put the phabricator repo lock file on persistent storage [puppet] - 10https://gerrit.wikimedia.org/r/268340 [01:44:51] (03PS1) 1020after4: Add missing & (typo?) [puppet] - 10https://gerrit.wikimedia.org/r/268341 [01:46:10] chasemp: it's the phab_repo_lock file, specified in the phabricator.pp role... it's on /var/run which is not persistent storage. [01:46:24] see ^^ patch to fix it [01:47:10] 12 [01:55:02] RECOVERY - puppet last run on iridium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:58:25] (03PS1) 10Yuvipanda: tools: Replace php5-mysql with php5-mysqlnd [puppet] - 10https://gerrit.wikimedia.org/r/268342 (https://phabricator.wikimedia.org/T125758) [01:59:07] mutante: should we have agents stop sending emails until this is fixed as I think these ones that are failing will just likely go unnoticed later when things are fixed - unless you think they are queued up and will send later but I don't see that.. [01:59:10] :-\ [02:00:22] RECOVERY - Last backup of the tools filesystem on labstore1001 is OK: OK - Last run for unit replicate-tools was successful [02:01:13] RD: yes, we should have them stop sending mails, they wont be queue up because OTRS fails to give them to the mail server [02:01:36] would you mind updating the MOTD in the system? [02:02:19] RD: unless you mean the server's motd i dont know how or where [02:02:26] The "News" messages on the main OTRS login screen can be editing by modifying /opt/otrs/Kernel/Output/HTML/Standard/Motd.dtl. [02:02:31] according to https://wikitech.wikimedia.org/wiki/OTRS [02:02:35] i have never logged in on OTRS before [02:02:45] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Replace php5-mysql with php5-mysqlnd [puppet] - 10https://gerrit.wikimedia.org/r/268342 (https://phabricator.wikimedia.org/T125758) (owner: 10Yuvipanda) [02:02:47] we as OTRS admins cannot do it [02:02:52] it's on the server end [02:03:03] I can give you the text if you can do it [02:03:36] hold on [02:04:54] i _think_ i found it [02:05:07] RD: what is the text [02:06:23] mutante: one moment [02:08:45] mutante: The operations team is aware of issues caused by the recent upgrade to OTRS 5. Until further notice please avoid sending messages as they may not be delivered. For more information and updates see OTRS wiki. [02:10:15] RD: ok, did it change ? [02:10:51] checking this is not being reverted by puppet [02:11:02] i have to turn it on on the admin end [02:11:12] we can turn it on, but not edit it lol [02:11:16] i did edit the template file [02:12:03] mutante: Yep - I turned them on and they are now present on login screen and dashboard [02:12:08] !log OTRS - changed motd message in /opt/otrs/Kernel/Output/HTML/Templates/Standard/Motd.tt - admins can turn it on and off [02:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:12:27] RD: ok, good [02:12:29] I wish we could edit that file [02:12:30] Thanks a lot [02:13:03] maybe we can puppetize it, then you could upload changes (but still need somebody to merge) [02:13:56] RD: has someone suggested that upstream to the OTRS team? [02:14:15] most people don't use the system like we do [02:14:18] But I'd have to look into it [02:14:28] We can talk about it when we get everything fixed :) [02:20:12] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [02:20:51] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [02:22:41] (03CR) 10Mattflaschen: [C: 031] Add echo tables to the list of private tables [puppet] - 10https://gerrit.wikimedia.org/r/268060 (https://phabricator.wikimedia.org/T125591) (owner: 10Jcrespo) [02:23:51] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [02:24:23] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [03:00:32] RECOVERY - Last backup of the others filesystem on labstore1001 is OK: OK - Last run for unit replicate-others was successful [03:14:11] (03Abandoned) 10BBlack: cache_parsoid: use local backends in codfw [puppet] - 10https://gerrit.wikimedia.org/r/266489 (owner: 10BBlack) [03:24:30] (03PS2) 10Tim Landscheidt: Tools: Switch portgrabber and portreleaser to proxymanager [puppet] - 10https://gerrit.wikimedia.org/r/268279 [03:24:32] (03PS3) 10Tim Landscheidt: Tools: Allow proxymanager to add and remove proxy forward entries [puppet] - 10https://gerrit.wikimedia.org/r/266448 [03:24:34] (03PS1) 10Tim Landscheidt: Tools: Decommission proxylistener [puppet] - 10https://gerrit.wikimedia.org/r/268346 [03:25:47] (03CR) 10Tim Landscheidt: [C: 04-1] "Depends on I717c8d220625971b169e7a578500e89c69545d74 being deployed." [puppet] - 10https://gerrit.wikimedia.org/r/268346 (owner: 10Tim Landscheidt) [03:28:40] (03PS1) 10Tim Landscheidt: Tools: Remove obsolete code [puppet] - 10https://gerrit.wikimedia.org/r/268347 [04:01:11] RECOVERY - Last backup of the maps filesystem on labstore1001 is OK: OK - Last run for unit replicate-maps was successful [05:37:12] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [05:37:27] critical critical 22 22 [05:37:32] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [05:38:19] Tyger, tyger, burning bright, in the forests of the night, what immortal hand or eye could frame thy fearful symmetry? [05:40:11] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet last ran 28885 seconds ago, expected 28800 [05:44:21] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:44:41] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:45:11] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet last ran 29185 seconds ago, expected 28800 [05:50:11] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet last ran 29485 seconds ago, expected 28800 [05:55:11] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet last ran 29785 seconds ago, expected 28800 [06:00:11] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet last ran 30085 seconds ago, expected 28800 [06:05:11] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet last ran 30385 seconds ago, expected 28800 [06:10:11] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet last ran 30685 seconds ago, expected 28800 [06:15:11] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet last ran 30985 seconds ago, expected 28800 [06:18:03] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [06:20:11] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet last ran 31285 seconds ago, expected 28800 [06:23:22] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [06:25:11] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet last ran 31585 seconds ago, expected 28800 [06:28:38] phabricator down for everyone or just me? [06:29:42] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: puppet fail [06:30:11] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet last ran 31885 seconds ago, expected 28800 [06:30:22] PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:42] PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:01] PROBLEM - puppet last run on db1045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:22] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:23] PROBLEM - https://phabricator.wikimedia.org on iridium is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string Wikimedia and MediaWiki not found on https://phabricator.wikimedia.org:443https://phabricator.wikimedia.org/ - 1017 bytes in 0.092 second response time [06:31:41] PROBLEM - puppet last run on mw1060 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:52] PROBLEM - puppet last run on mw1215 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:53] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:03] PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:04] PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:22] PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:22] PROBLEM - puppet last run on cp1064 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:18] Nikerabbit: PROBLEM - https://phabricator.wikimedia.org on iridium is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string Wikimedia and MediaWiki not found on … [06:33:22] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:59] From phab: >>> UNRECOVERABLE FATAL ERROR <<< Call to undefined method AlmanacCreateClusterServicesCapability::getPhobjectClassConstant() [06:34:02] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:11] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet last ran 32185 seconds ago, expected 28800 [06:37:01] p858snake: yes my colleaugues confirmed [06:38:40] Is phab down? [06:38:54] I think so [06:38:58] >>> UNRECOVERABLE FATAL ERROR <<< Call to undefined method AlmanacCreateClusterServicesCapability::getPhobjectClassConstant() [06:38:59] Luke081515: SyntaxError: Unexpected identifier [06:38:59] >>> UNRECOVERABLE FATAL ERROR <<< [06:38:59] Call to undefined method AlmanacCreateClusterServicesCapability::getPhobjectClassConstant() [06:38:59] /srv/phab/phabricator/src/applications/policy/capability/PhabricatorPolicyCapability.php:18 [06:38:59] ┻━┻ ︵ ¯\_(ツ)_/¯ ︵ ┻━┻ [06:39:00] zhuyifei1999_: SyntaxError: Unexpected identifier [06:40:11] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet last ran 32485 seconds ago, expected 28800 [06:40:40] PHP Fatal error: Call to undefined function phutil_json_encode() in /srv/phab/phabricator/src/infrastructure/storage/lisk/LiskDAO.php on line 1654 [06:40:42] from the logs [06:40:44] no clue [06:40:48] hrmm [06:40:57] should we perhaps wake one of our phab admins, like twentyafterfour? [06:41:12] or is phabricator downtime not something to wake folks over? [06:41:14] the host was rebooted to get a kernel update earlier [06:41:26] it worked post reboot though right? [06:41:35] yes, for hours [06:41:40] twentyafterfour is actually probably awake I would guess [06:41:44] '/srv/phab/phabricator/bin/repository' update -- 'OPUP' [06:41:55] there are a bunch of these, with different 4 letter names [06:41:58] all failing [06:42:21] this is where I wanna pass the buck only because i'm stil pretty sleepy [06:42:48] I know it's almost 9 but I'm waking up too early these days and going to bed too late [06:42:51] elukey: you awake and about yet? [06:43:23] its past midnight in twentyafterfour's tz so im not sure if phabricator downtime is wakable or not for folks outside of ops [06:43:28] ah [06:43:42] not sf? [06:43:48] apergos: did you get paged or just saw it go down? [06:43:53] if you got paged then all the eu folks did. [06:43:55] Nope, mid-west [06:44:09] paged [06:45:11] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet last ran 32785 seconds ago, expected 28800 [06:45:11] PROBLEM - puppet last run on db2069 is CRITICAL: CRITICAL: puppet fail [06:46:04] it's cycling through this same list of pulls over and over [06:47:54] Well, im not sure how to fix it but I'm also not sure if this is an outage where I should be calling folks =/ [06:49:32] !log phabricator down with errors during repo updates in phd daemon log [06:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:50:11] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet last ran 33085 seconds ago, expected 28800 [06:50:43] so I see a bunch of interested stuff from a puppet run after the reboot, [06:51:07] in the syslog with a timestamp of Feb 4 01:04:33 [06:52:23] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [06:52:27] but it didnt die until well after the reboot [06:52:42] PROBLEM - puppet last run on ms-be3002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:52:46] yeah I know [06:53:10] I don't know what normal behavior is so I don't know if those runs are ignoreable or not [06:54:31] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [06:55:12] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet last ran 33385 seconds ago, expected 28800 [06:55:52] RECOVERY - puppet last run on db1045 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:56:23] RECOVERY - puppet last run on mw1060 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:56:42] RECOVERY - puppet last run on mw1215 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:43] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:56:44] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:56:51] RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:52] RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:11] RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:57:12] RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:57:22] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:02] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:13] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:52] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:01] RECOVERY - puppet last run on cp1064 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:41] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:00:11] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet last ran 33685 seconds ago, expected 28800 [07:00:22] !log on iridium in /srv/deployment/phabricator/deploy/phabricator, naming the currently detached git branch ‘andrewfounditlikethis' [07:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:01:51] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:05:11] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet last ran 33985 seconds ago, expected 28800 [07:10:11] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet last ran 34286 seconds ago, expected 28800 [07:12:01] RECOVERY - puppet last run on db2069 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [07:15:11] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet last ran 34585 seconds ago, expected 28800 [07:19:16] twentyafterfour is being called, as we've spent a bit of time on this without real progress [07:19:16] well, without any progress, tbh [07:19:32] RECOVERY - puppet last run on ms-be3002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:20:11] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet last ran 34886 seconds ago, expected 28800 [07:25:11] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet last ran 35186 seconds ago, expected 28800 [07:28:02] apergos: I'm here [07:28:11] thank you [07:28:17] we're chattingin _security [07:30:11] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet last ran 35485 seconds ago, expected 28800 [07:32:12] RECOVERY - https://phabricator.wikimedia.org on iridium is OK: HTTP OK: HTTP/1.1 200 OK - 23111 bytes in 0.173 second response time [07:32:40] ---^ \o/ [07:32:52] thank you [07:33:17] 6operations, 10OTRS: Error while sending emails with OTRS - https://phabricator.wikimedia.org/T125756#1996641 (10Dzahn) also //"otrs.PostMaster.pl is deprecated, please use console command 'Maint::PostMaster::Read' instead." // [07:34:44] 6operations, 10MediaWiki-General-or-Unknown, 10MediaWiki-Maintenance-scripts, 7Upstream: PHP 5.5.9 seems to have issues parsing some argumentless command line parameters - https://phabricator.wikimedia.org/T125748#1996646 (10Smalyshev) Docs for getopt say: ``` The parsing of options will end at the first... [07:35:11] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet last ran 35785 seconds ago, expected 28800 [07:35:16] !log disabling puppet on iridium to prevent it from smashing phabricator (as it seems to do now and then) [07:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:35:27] ^I did it [07:37:35] 6operations, 10OTRS: Error while sending emails with OTRS - https://phabricator.wikimedia.org/T125756#1996648 (10Dzahn) [07:40:03] 6operations, 10OTRS, 5Patch-For-Review, 7user-notice: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#1996652 (10Dzahn) @Akosiaris please see T125756 some outgoing notification mail fails, and i found [Error][Kernel::System::Email::Sendmail::Send][Line:85]: Can't se... [07:40:11] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet last ran 36085 seconds ago, expected 28800 [07:45:11] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet last ran 36385 seconds ago, expected 28800 [07:49:56] !log git checkout tag release/2015-11-18/1 for phab & libphutil on iridiuum [07:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:50:11] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet last ran 36686 seconds ago, expected 28800 [07:51:39] !log phabricator repositories checked out to these revisions: http://pastebin.com/JxEaYKiW [07:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:55:11] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet last ran 36985 seconds ago, expected 28800 [07:59:49] (03PS1) 10Rush: phabricator: forward the old tag system to current release/2015-11-18/1 [puppet] - 10https://gerrit.wikimedia.org/r/268351 [08:00:11] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet last ran 37285 seconds ago, expected 28800 [08:05:11] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet last ran 37585 seconds ago, expected 28800 [08:05:43] (03PS2) 10Rush: phabricator: forward the old tag system to current release/2015-11-18/1 [puppet] - 10https://gerrit.wikimedia.org/r/268351 [08:10:11] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet last ran 37885 seconds ago, expected 28800 [08:14:56] !log iridium puppet agent --enable && puppet agent --disable "DO NO ENABLE AS IT WILL BREAK THINGS CONTACT MUKUNDA" [08:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:15:04] 6operations, 10OTRS: Error while sending emails with OTRS - https://phabricator.wikimedia.org/T125756#1996692 (10Dzahn) RD (one of the OTRS admins) asked me to edit the motd template to announce the current issues. I found it in `/opt/otrs/Kernel/Output/HTML/Templates/Standard/Motd.tt` and edited the templa... [08:15:11] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet last ran 38185 seconds ago, expected 28800 [08:20:12] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet last ran 38485 seconds ago, expected 28800 [08:23:47] 6operations, 10Beta-Cluster-Infrastructure, 6Services: Move Node.JS services to Jessie and Node 4.2 - https://phabricator.wikimedia.org/T124989#1996707 (10mobrovac) [08:25:11] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet last ran 38785 seconds ago, expected 28800 [08:25:46] <_joe_> wtf americium [08:30:11] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet last ran 39086 seconds ago, expected 28800 [08:38:18] 6operations, 10Traffic, 7Pybal: pybal etcd coroutine crashed - https://phabricator.wikimedia.org/T125397#1997024 (10Joe) Ok I think I pinned down a case where this can happen: whenever we delete a directory, what happens is that etcd sends down the wire ``` {"action":"delete","node":{"key":"/testdir","dir"... [08:39:11] 6operations, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#1996717 (10Bianjiang) With T78676 and other related efforts (separating content into different APIs), 50 req/s global limit (limit by UserAgent?) is not enough even for regular incremental craw... [08:51:11] 6operations, 10Traffic, 7Pybal: pybal etcd coroutine crashed - https://phabricator.wikimedia.org/T125397#1997048 (10Joe) That happens more in general when we remove a node. [08:54:17] morning [08:56:03] apparently we have got 1.27.0-wmf.12 on group 1 \o/ [09:07:43] 6operations, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#1996743 (10GWicke) 5declined>3Open [09:10:59] !log converting remaining InnoDB tables (s3) to TokuDB on db1069 [09:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:23:00] 6operations, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#1996757 (10GWicke) Reopening, reflecting the ongoing discussion. [09:29:02] 6operations, 10DBA, 6Labs, 10Labs-Infrastructure: db1069 is running low on space - https://phabricator.wikimedia.org/T124464#1997110 (10jcrespo) p:5Triage>3Normal [09:30:55] 6operations, 10DBA, 6Labs, 10Labs-Infrastructure: db1069 is running low on space - https://phabricator.wikimedia.org/T124464#1997115 (10jcrespo) There were many innodb tables there, but mostly not on purpose, as the come from imports/backup recovery/alters/new deployment. Doing a (slow) batch conversion of... [09:34:02] (03PS2) 10Jcrespo: Add echo tables to the list of private tables [puppet] - 10https://gerrit.wikimedia.org/r/268060 (https://phabricator.wikimedia.org/T125591) [09:34:16] (03PS1) 10ArielGlenn: dumps: skip labtestwiki for addschanged and pagetitles dumps [puppet] - 10https://gerrit.wikimedia.org/r/268356 [09:34:53] !log depooling restbase2004 for kernel reboot/Java update [09:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:39:12] !log re-enabling puppet on mw1161 [09:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:39:32] (03CR) 10ArielGlenn: [C: 032] dumps: skip labtestwiki for addschanged and pagetitles dumps [puppet] - 10https://gerrit.wikimedia.org/r/268356 (owner: 10ArielGlenn) [09:39:37] (03CR) 10Jcrespo: [C: 032] "As far as I can see, this is only used by sanitarium, so I will go on an deploy it." [puppet] - 10https://gerrit.wikimedia.org/r/268060 (https://phabricator.wikimedia.org/T125591) (owner: 10Jcrespo) [09:39:42] PROBLEM - puppet last run on mw1161 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [09:39:45] (03PS3) 10Jcrespo: Add echo tables to the list of private tables [puppet] - 10https://gerrit.wikimedia.org/r/268060 (https://phabricator.wikimedia.org/T125591) [09:41:24] 6operations, 7Monitoring: switch diamond to use graphite line protocol - https://phabricator.wikimedia.org/T121861#1997130 (10fgiunchedi) not seeing huge variations in http://graphite.wmflabs.org/render/?width=593&height=355&_salt=1454516355.421&from=-2days&target=monitoring.filippo-test-trusty.cpu.total.guest... [09:41:32] RECOVERY - puppet last run on mw1161 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [09:42:12] (03CR) 10Joal: "Yes ottomata, let's do that !" [puppet] - 10https://gerrit.wikimedia.org/r/267924 (https://phabricator.wikimedia.org/T124947) (owner: 10Eevans) [09:46:42] !log repooling restbase2004, depooling restbase2005 for kernel reboot/Java update [09:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:46:45] !log elastic in codfw: reducing the number of replicas from 0-3 to 0-2 for commonswiki_file [09:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:51:28] (03PS1) 10Filippo Giunchedi: diamond: send labs instance metrics via graphite/carbon [puppet] - 10https://gerrit.wikimedia.org/r/268360 (https://phabricator.wikimedia.org/T121861) [09:52:39] (03CR) 10jenkins-bot: [V: 04-1] diamond: send labs instance metrics via graphite/carbon [puppet] - 10https://gerrit.wikimedia.org/r/268360 (https://phabricator.wikimedia.org/T121861) (owner: 10Filippo Giunchedi) [09:56:08] (03PS2) 10Filippo Giunchedi: diamond: send labs instance metrics via graphite/carbon [puppet] - 10https://gerrit.wikimedia.org/r/268360 (https://phabricator.wikimedia.org/T121861) [09:56:27] (03PS1) 10Giuseppe Lavagetto: Add tests for etcd [debs/pybal] - 10https://gerrit.wikimedia.org/r/268361 (https://phabricator.wikimedia.org/T125397) [09:57:08] !log repooling restbase2005, depooling restbase2006 for kernel reboot/Java update [09:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:58:41] (03CR) 10jenkins-bot: [V: 04-1] Add tests for etcd [debs/pybal] - 10https://gerrit.wikimedia.org/r/268361 (https://phabricator.wikimedia.org/T125397) (owner: 10Giuseppe Lavagetto) [10:01:38] !log applying live on the 7 sanitarium instance the newly puppet-configured labs replication filters [10:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:04:01] (03CR) 10DCausse: [C: 031] "I've set manually commonswiki_file to 0-2 today, we needed to restart the cluster so I prefered to have a green cluster during restarts." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266658 (owner: 10EBernhardson) [10:07:10] 6operations, 10OTRS: Error while sending emails with OTRS - https://phabricator.wikimedia.org/T125756#1997171 (10akosiaris) Looking at https://ganglia.wikimedia.org/latest/graph.php?h=mendelevium.eqiad.wmnet&m=cpu_report&r=custom&s=by%20name&hc=4&mc=2&cs=2%2F3%2F2016%2014%3A21&ce=2%2F4%2F2016%208%3A2&st=145458... [10:10:11] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 671 [10:11:07] !log repooling restbase2006 [10:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:13:57] !log running smartctl -t long on kafka1012 (kafka not running, host de-pooled from the broker list) [10:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:14:21] !log testing new replication filters from production's testwiki [10:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:15:56] 6operations, 10OTRS: Error while sending emails with OTRS - https://phabricator.wikimedia.org/T125756#1997180 (10akosiaris) 5Open>3stalled [10:16:35] !log rolling reboot of maps cluster for kernel update [10:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:20:11] RECOVERY - check_mysql on db1008 is OK: Uptime: 1363314 Threads: 2 Questions: 8196840 Slow queries: 9190 Opens: 3153 Flush tables: 2 Open tables: 399 Queries per second avg: 6.012 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:24:02] 10Ops-Access-Requests, 10Ops-Access-Reviews, 6operations, 3Discovery-Search-Sprint, 5Patch-For-Review: Access for new Discovery OpsEng: Guillaume Lederrey - https://phabricator.wikimedia.org/T125651#1997195 (10elukey) [10:31:19] Is there already a phabricator outage report? [10:31:31] !sal [10:31:31] https://wikitech.wikimedia.org/wiki/Server_Admin_Log https://tools.wmflabs.org/sal/production See it and you will know all you need. [10:32:23] Luke081515: recently created issue https://phabricator.wikimedia.org/maniphest/query/1ftUfQtJOf_d/#R [10:32:28] 6operations, 10DBA, 6Labs, 5Patch-For-Review: Set up additional filters for Echo tables - https://phabricator.wikimedia.org/T125591#1997200 (10jcrespo) Filters are both puppetized and applied live on sanitarium/labs. They have been tested to work succesfully. Now I have to delete all echo tables from there. [10:33:21] 10Ops-Access-Requests, 10Ops-Access-Reviews, 6operations, 3Discovery-Search-Sprint, 5Patch-For-Review: Access for new Discovery OpsEng: Guillaume Lederrey - https://phabricator.wikimedia.org/T125651#1997206 (10elukey) Hello! Following https://wikitech.wikimedia.org/wiki/Ops_Clinic_Duty#Access_requests t... [10:34:59] !log hashar@mira Synchronized php-1.27.0-wmf.12/.gitmodules: Set branch in .gitmodules for extensions/Wikidata https://gerrit.wikimedia.org/r/#/c/268218/ (duration: 02m 08s) [10:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:35:49] !log elastic codfw: freezing writes and setting cluster.routing.allocation.balance.threshold to 100% (fast recovery test) [10:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:37:17] hashar: Not everything at phab is like yesterday... phab is using the normal favicon at the moment, not the WMF one [10:40:34] 6operations, 10OTRS: Error while sending emails with OTRS - https://phabricator.wikimedia.org/T125756#1997253 (10akosiaris) >>! In T125756#1996692, @Dzahn wrote: > RD (one of the OTRS admins) asked me to edit the motd template to announce the current issues. > > I found it in `/opt/otrs/Kernel/Output/HTML/Tem... [10:42:09] Luke081515: I am not sure what happened to be honest [10:48:30] 6operations, 10OTRS: Error while sending emails with OTRS - https://phabricator.wikimedia.org/T125756#1997272 (10akosiaris) Memory usage is still at 2.6G, seems to be increasing https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=mendelevium.eqiad.wmnet&m=cpu_report&s=descending&mc=2&g=mem_report... [10:52:09] (03PS2) 10Giuseppe Lavagetto: Allow the etcd driver to handle deleted or inactive nodes [debs/pybal] - 10https://gerrit.wikimedia.org/r/268361 (https://phabricator.wikimedia.org/T125397) [10:52:35] <_joe_> ema: ^^ this should fix the pybal issue [10:53:13] (03CR) 10jenkins-bot: [V: 04-1] Allow the etcd driver to handle deleted or inactive nodes [debs/pybal] - 10https://gerrit.wikimedia.org/r/268361 (https://phabricator.wikimedia.org/T125397) (owner: 10Giuseppe Lavagetto) [11:04:31] _joe_: thanks [11:05:15] is there a missing import though? "global name 'json' is not defined" [11:16:30] (03PS1) 10Alexandros Kosiaris: otrs: PostMaster.pl is deprecated, replace it with Console.pl [puppet] - 10https://gerrit.wikimedia.org/r/268369 [11:17:58] <_joe_> ema: yes I am fixing that [11:18:14] <_joe_> PERL ALERT! [11:18:57] (03CR) 10Alexandros Kosiaris: [C: 032] otrs: PostMaster.pl is deprecated, replace it with Console.pl [puppet] - 10https://gerrit.wikimedia.org/r/268369 (owner: 10Alexandros Kosiaris) [11:18:58] !log elastic codfw: resuming writes and setting cluster.routing.allocation.balance.threshold back to default (1%) [11:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:19:12] (03PS2) 10Alexandros Kosiaris: otrs: redirect iodine to the test database [puppet] - 10https://gerrit.wikimedia.org/r/268098 [11:19:18] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] otrs: redirect iodine to the test database [puppet] - 10https://gerrit.wikimedia.org/r/268098 (owner: 10Alexandros Kosiaris) [11:19:34] (03PS2) 10Alexandros Kosiaris: etherpad: Be pedantic about defaultPadText [puppet] - 10https://gerrit.wikimedia.org/r/267866 [11:19:39] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] etherpad: Be pedantic about defaultPadText [puppet] - 10https://gerrit.wikimedia.org/r/267866 (owner: 10Alexandros Kosiaris) [11:21:24] and of course I also found a race in puppet-merge... sigh [11:21:33] (03PS3) 10Filippo Giunchedi: uwsgi: create /run/uwsgi, including at boot [puppet] - 10https://gerrit.wikimedia.org/r/268126 [11:21:45] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] uwsgi: create /run/uwsgi, including at boot [puppet] - 10https://gerrit.wikimedia.org/r/268126 (owner: 10Filippo Giunchedi) [11:21:46] 6operations, 10DBA, 7Tracking: Migrate MySQLs to use ROW-based replication (tracking) - https://phabricator.wikimedia.org/T109179#1997343 (10jcrespo) [11:21:49] 6operations, 10DBA, 10MediaWiki-Special-pages, 10Wikidata, 7Performance: Batch updates create slave lag on s3 over WAN - https://phabricator.wikimedia.org/T122429#1997342 (10jcrespo) [11:22:41] 6operations, 10DBA, 10MediaWiki-Special-pages, 10Wikidata, 7Performance: Batch updates create slave lag on s3 over WAN - https://phabricator.wikimedia.org/T122429#1904041 (10jcrespo) [11:22:44] 6operations, 10DBA, 7Tracking: Migrate MySQLs to use ROW-based replication (tracking) - https://phabricator.wikimedia.org/T109179#1542524 (10jcrespo) [11:23:22] 6operations, 10DBA, 10MediaWiki-Special-pages, 10Wikidata, 7Performance: Batch updates create slave lag on s3 over WAN - https://phabricator.wikimedia.org/T122429#1904041 (10jcrespo) [11:24:32] 6operations, 10ops-eqiad, 5Patch-For-Review: SMART errors on kafka1012.eqiad.wmfnet - https://phabricator.wikimedia.org/T125199#1997364 (10elukey) We keep receiving emails from smartd related to failures for other drives on kafka1012, so I tried to run smartctl -l short tests for two disk and didn't find any... [11:27:13] 6operations, 10Traffic: 3x cache_upload crashed in a short time window - https://phabricator.wikimedia.org/T125401#1997393 (10MoritzMuehlenhoff) I looked into this for a bit, but couldn't really pin this down cp3042/cp3049 have had the same crash in RCU handling. This _might_ have been fixed by this commit wh... [11:31:00] 10Ops-Access-Requests, 10Ops-Access-Reviews, 6operations, 3Discovery-Search-Sprint, 5Patch-For-Review: Access for new Discovery OpsEng: Guillaume Lederrey - https://phabricator.wikimedia.org/T125651#1997400 (10elukey) Addendum: the new user will be added to the groups only after TechOps meeting review, t... [11:31:41] (03PS3) 10Giuseppe Lavagetto: Allow the etcd driver to handle deleted or inactive nodes [debs/pybal] - 10https://gerrit.wikimedia.org/r/268361 (https://phabricator.wikimedia.org/T125397) [11:32:20] 6operations, 7Availability, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: swiftrepl replication pass for thumbnails eqiad -> codfw - https://phabricator.wikimedia.org/T125791#1997406 (10fgiunchedi) 3NEW a:3aaron [11:32:50] 6operations, 10MediaWiki-Interface, 5MW-1.27-release, 5MW-1.27-release-notes, and 4 others: Broken mobile edit section links are showing up in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#1997415 (10Danny_B) Could some... [11:32:58] 10Ops-Access-Requests, 10Ops-Access-Reviews, 6operations, 3Discovery-Search-Sprint, 5Patch-For-Review: Access for new Discovery OpsEng: Guillaume Lederrey - https://phabricator.wikimedia.org/T125651#1997417 (10elukey) p:5Triage>3Normal [11:34:51] 6operations, 10MediaWiki-Interface, 10Traffic, 5MW-1.27-release, and 5 others: Broken mobile edit section links are showing up in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#1997420 (10Joe) [11:36:26] (03PS1) 10Filippo Giunchedi: swiftrepl: add debian packaging [software] - 10https://gerrit.wikimedia.org/r/268371 [11:37:01] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swiftrepl: add debian packaging [software] - 10https://gerrit.wikimedia.org/r/268371 (owner: 10Filippo Giunchedi) [11:38:34] (03PS2) 10JanZerebecki: Update WikidataBuildResources git source (github -> gerrit) [puppet] - 10https://gerrit.wikimedia.org/r/267242 (https://phabricator.wikimedia.org/T111173) (owner: 10Aude) [11:39:46] (03CR) 10JanZerebecki: [C: 031] "Good to merge. I took care of the one node that uses this." [puppet] - 10https://gerrit.wikimedia.org/r/267242 (https://phabricator.wikimedia.org/T111173) (owner: 10Aude) [11:44:57] !log dropping echo_* tables from labs [11:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:05:28] 6operations, 10DBA, 6Labs, 5Patch-For-Review: Set up additional filters for Echo tables - https://phabricator.wikimedia.org/T125591#1997494 (10jcrespo) Dropping is ongoing, it will take some time as I do not want to affect labs' replication lag. [12:08:31] !log rebooting db2001 to db2019 for kernel update [12:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:09:11] and the pages start [12:10:29] jynus: I marked these at downtime in icinga, though? [12:10:59] well, there is a replication topology, if you kill the master, the slaves complain [12:11:41] graphite may complain too [12:21:03] !log starting mysql at db2009 [12:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:23:16] PROBLEM - MariaDB Slave Lag: s7 on db1034 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 341 [12:23:49] jynus: ? ^ I suppose that's you ^ ? [12:24:18] no [12:24:27] db1034 has problems [12:25:19] I will depool it if I can [12:26:48] RECOVERY - MariaDB Slave Lag: s7 on db1034 is OK: OK slave_sql_lag Seconds_Behind_Master: 1 [12:27:15] mark I got some hard numbers on the maps cluster performance, let me know what you think. https://phabricator.wikimedia.org/T125126 [12:27:35] (03PS1) 10Jcrespo: Depool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268377 [12:28:09] <_joe_> every time we say "hard" numbers (I do that too), I'm left wondering what "soft numbers" would mean :P [12:28:20] I do not know if to proceed, it seems to have gone, I am going to wait until seing the cause [12:30:44] I see the cause [12:31:16] (03Abandoned) 10Jcrespo: Depool db1034 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268377 (owner: 10Jcrespo) [12:37:50] _joe_, soft numbers are like soft tabs - they fill up all of the available budget [12:38:47] ApiQueryContributions::execute was going out of hand because the long queries kill process had died [12:52:44] moritzm, if you want to update other maps servers, go ahead - i will update tilerator in 5 hours during the deployment window. Its not needed at the moment [12:54:00] moritzm, you can disable tilerator & tileratorui services on all machines - i will reenable them after deployment [12:54:21] yurik: ok, will continue with the maps200[2-4] in a bit [12:55:30] yurik: if they're doing no harm atm, we can keep the tilerator* services as-is, the icinga check for those is silenced anyway [12:56:45] moritzm, sure [12:58:14] moritzm, i wonder if logs are being flooded... than again, logs have been very weird with the tilerator and kartotherian - it seems something is broken and no new files are being created in /var/log/tilerator [12:58:29] might be something in logstash [12:59:26] moritzm, yes, logstash is being flooded with restarts, might as well kill them [13:04:25] yurik: ok, I've stopped tileratorui.service and tilerator.service on maps-test2001 (and next on the other three) [13:15:45] Hello there ops people. I'm having a jenkins issue where node Promise is undefined https://integration.wikimedia.org/ci/job/npm/50198/console what version of node is that thing running? [13:16:22] 6operations, 10MediaWiki-Interface, 10Traffic, 5MW-1.27-release, and 5 others: Broken mobile edit section links are showing up in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#1997604 (10Krenair) [13:16:30] <_joe_> jan_drewniak: I guess people would know in #wikimedia-releng [13:17:10] _joe_ thanks! [13:21:20] jan_drewniak: from the full output https://integration.wikimedia.org/ci/job/npm/50198/consoleFull [13:21:29] jan_drewniak: node --version -> v0.10.25 [13:21:55] will probably migrate all to node 4.2 next week [13:22:40] hashar: that sounds great! 0.10.25 is pretty old. [13:23:47] jan_drewniak: yup and the services team had nodejs 4.2.4 back ported for our Debian Jessie system [13:24:04] jan_drewniak: we "just" have to migrate the CI job from Trusty slaves to Jessie and hopefully it will work :D [13:24:12] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 13.79% of data above the critical threshold [100000000.0] [13:26:01] 6operations, 6Services: Update node_js to latest 0.10.x release - https://phabricator.wikimedia.org/T119218#1997630 (10hashar) 5Open>3declined a:3hashar We are moving to Jessie and a backport of Nodejs 4.2.4. Example: {T124989} The child task for CI is {T119143} [13:26:47] jan_drewniak: you can subscribe to https://phabricator.wikimedia.org/T119143 which tracks npm CI job migration to Jessie and as a consequence 4.2.4 [13:26:59] thanks hashar! [13:33:02] PROBLEM - RAID on ms-be2020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:35:57] akosiaris, do you know why none of the maps services produce any log files? [13:36:07] mobrovac, ^? [13:38:07] er, that's your service, if you don't know.. that's bad [13:38:17] means probably nobody does... [13:38:41] I see kartotherian is supposed to log into /var/log/kartotherian/main.log [13:38:42] PROBLEM - puppet last run on ms-be2020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:39:12] PROBLEM - SSH on ms-be2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:33] !log continue rolling reboot of maps cluster for kernel update (2002-2004) [13:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:41:23] PROBLEM - swift-container-replicator on ms-be2020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:41:23] PROBLEM - swift-account-reaper on ms-be2020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:41:51] PROBLEM - swift-object-auditor on ms-be2020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:41:51] PROBLEM - swift-container-auditor on ms-be2020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:41:51] PROBLEM - swift-container-server on ms-be2020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:42:13] PROBLEM - dhclient process on ms-be2020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:42:21] PROBLEM - swift-account-auditor on ms-be2020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:42:22] PROBLEM - swift-container-updater on ms-be2020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:42:22] PROBLEM - swift-object-replicator on ms-be2020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:42:22] PROBLEM - swift-object-updater on ms-be2020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:42:32] PROBLEM - swift-account-server on ms-be2020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:42:41] PROBLEM - salt-minion processes on ms-be2020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:42:42] PROBLEM - swift-account-replicator on ms-be2020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:43:11] PROBLEM - swift-object-server on ms-be2020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:43:21] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [13:46:12] PROBLEM - very high load average likely xfs on ms-be2020 is CRITICAL: CRITICAL - load average: 146.17, 113.98, 64.10 [13:48:25] looking at ms-be2020 [13:50:11] 6operations, 10Traffic, 5Patch-For-Review: Create separate packages for required vmods - https://phabricator.wikimedia.org/T124281#1997739 (10ema) libvmod-tbf packaged and [[ https://gerrit.wikimedia.org/r/#/admin/projects/operations/software/varnish/libvmod-tbf | pushed to gerrit ]]. Some additional work wa... [13:56:43] PROBLEM - swift codfw-prod object availability on graphite1001 is CRITICAL: CRITICAL: 20.41% of data under the critical threshold [90.0] [13:56:59] !log powercycle ms-be2020 [13:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:57:31] 6operations, 10Traffic, 5Patch-For-Review, 7Pybal: pybal etcd coroutine crashed - https://phabricator.wikimedia.org/T125397#1997767 (10Joe) I just wrote some unit tests and they confirm my suspicion: we do not manage DELETE in etcd within pybal. The tests can be seen in the (still failing) PS [13:59:21] RECOVERY - very high load average likely xfs on ms-be2020 is OK: OK - load average: 7.99, 2.12, 0.72 [13:59:22] RECOVERY - swift-account-server on ms-be2020 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [13:59:22] RECOVERY - puppet last run on ms-be2020 is OK: OK: Puppet is currently enabled, last run 35 minutes ago with 0 failures [13:59:22] RECOVERY - salt-minion processes on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:59:32] (03PS1) 10Giuseppe Lavagetto: Make etcd updates more fault-tolerant [debs/pybal] - 10https://gerrit.wikimedia.org/r/268382 (https://phabricator.wikimedia.org/T125397) [13:59:41] RECOVERY - swift-account-replicator on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [13:59:52] RECOVERY - SSH on ms-be2020 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 (protocol 2.0) [14:00:02] RECOVERY - swift-object-server on ms-be2020 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [14:00:13] RECOVERY - swift-container-replicator on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [14:00:13] RECOVERY - swift-account-reaper on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [14:00:42] RECOVERY - swift-container-auditor on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:00:42] RECOVERY - swift-object-auditor on ms-be2020 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [14:00:42] RECOVERY - swift-container-server on ms-be2020 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [14:00:46] (03CR) 10jenkins-bot: [V: 04-1] Make etcd updates more fault-tolerant [debs/pybal] - 10https://gerrit.wikimedia.org/r/268382 (https://phabricator.wikimedia.org/T125397) (owner: 10Giuseppe Lavagetto) [14:01:12] RECOVERY - dhclient process on ms-be2020 is OK: PROCS OK: 0 processes with command name dhclient [14:01:12] RECOVERY - swift-container-updater on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [14:01:12] RECOVERY - swift-account-auditor on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [14:01:12] RECOVERY - swift-object-updater on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [14:01:13] RECOVERY - swift-object-replicator on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [14:01:13] RECOVERY - RAID on ms-be2020 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [14:04:50] 10Ops-Access-Requests, 10Ops-Access-Reviews, 6operations, 3Discovery-Search-Sprint, 5Patch-For-Review: Access for new Discovery OpsEng: Guillaume Lederrey - https://phabricator.wikimedia.org/T125651#1997809 (10Gehel) @elukey: https://phabricator.wikimedia.org/L3 reviewed and "signed". Thanks for the help ! [14:09:33] !log Restarted blazegraph on wdqs1001 [14:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:10:15] (03PS1) 10Gehel: Adding monitoring of some ElasticSearch thread pools [puppet] - 10https://gerrit.wikimedia.org/r/268384 (https://phabricator.wikimedia.org/T125782) [14:12:37] (03PS1) 10Ema: Skip tests if varnishtest is not installed [software/varnish/libvmod-header] (debian) - 10https://gerrit.wikimedia.org/r/268385 (https://phabricator.wikimedia.org/T124281) [14:12:46] 6operations, 10DBA, 6Labs, 5Patch-For-Review: Set up additional filters for Echo tables - https://phabricator.wikimedia.org/T125591#1997835 (10jcrespo) These tables where deleted in db1069 and labsdb*: {F3310163} {F3310164} [14:13:51] SMalyshev: https://phabricator.wikimedia.org/T125818 [14:14:23] (03CR) 10DCausse: [C: 031] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/268384 (https://phabricator.wikimedia.org/T125782) (owner: 10Gehel) [14:18:39] 6operations, 10MediaWiki-Interface, 10Traffic, 5MW-1.27-release, and 5 others: Broken mobile edit section links are showing up in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#1997851 (10BBlack) Can we get some more-spe... [14:19:53] 6operations, 10ops-eqiad, 10DBA: Decommission pc1001-1003 - https://phabricator.wikimedia.org/T124962#1997856 (10jcrespo) [14:24:39] 6operations, 10ops-codfw: ms-be2015 doesn't come up after reboot - https://phabricator.wikimedia.org/T125383#1997873 (10fgiunchedi) ok, it doesn't seem to be able to boot back up via grub, I'm reimaging [14:24:42] PROBLEM - puppet last run on mw2133 is CRITICAL: CRITICAL: puppet fail [14:25:11] 6operations, 10ops-eqiad, 10DBA: Decommission pc1001-1003 - https://phabricator.wikimedia.org/T124962#1997874 (10jcrespo) Enough time has passed since they were depooled and the new servers, pooled with no incident reported. They can be stopped/deleted/etc with no loss. [14:27:28] greg-g: I will probably need to deploy a JavaScript fix to wikidata soon-ish. https://gerrit.wikimedia.org/r/#/c/268391/2/view/resources/jquery/wikibase/jquery.wikibase.entitytermsforlanguagelistview.js [14:28:17] 6operations, 7HHVM: Provide a HHVM package for jessie-wikimedia matching version of trusty-wikimedia - https://phabricator.wikimedia.org/T125821#1997890 (10hashar) 3NEW [14:28:54] 6operations, 5Continuous-Integration-Scaling, 7HHVM: Provide a HHVM package for jessie-wikimedia matching version of trusty-wikimedia - https://phabricator.wikimedia.org/T125821#1997897 (10hashar) [14:29:28] 6operations, 5Continuous-Integration-Scaling, 7HHVM: Provide a HHVM package for jessie-wikimedia matching version of trusty-wikimedia - https://phabricator.wikimedia.org/T125821#1997890 (10hashar) [14:32:29] 6operations, 10ops-eqiad, 10DBA: Decommission pc1001-1003 - https://phabricator.wikimedia.org/T124962#1997918 (10jcrespo) a:3RobH @Robh do you have any input on this (need for datacenter space/set them as spare vs. unrack them?). I am assigning to you, but you can give it me back to me or anyone else/up fo... [14:35:52] 6operations, 10DBA, 6Labs, 5Patch-For-Review: Set up additional filters for Echo tables - https://phabricator.wikimedia.org/T125591#1997943 (10jcrespo) 5Open>3Resolved @Mattflaschen This is done, please continue reporting any potential issue regarding privacy && labs, even if data was not really expose... [14:41:56] !log rebooting db203[45] for kernel update [14:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:50:03] morebots, please note that s1 and s7 servers (such as db2034.codfw.wmnet) will log errors due to health checks [14:50:03] I am a logbot running on tools-exec-1204. [14:50:03] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [14:50:03] To log a message, type !log . [14:50:17] not you, moritzm [14:50:37] those are not a problem, because it is not real traffic [14:51:15] PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: puppet fail [14:51:19] ok [14:52:01] but I freaked out the first time I saw https://logstash.wikimedia.org/#dashboard/temp/AVKsxHVzptxhN1XaD0z5 [14:52:15] (some months ago) [14:52:45] RECOVERY - puppet last run on mw2133 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:57:35] 6operations, 10DBA: Investigate/decom db2001-db2008 - https://phabricator.wikimedia.org/T125827#1998016 (10jcrespo) 3NEW [14:58:23] !log stopping eventlogging to reboot eventlog1001 for kernel update [14:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:00:40] (03PS2) 10Ottomata: Change default consistency to localOne [puppet] - 10https://gerrit.wikimedia.org/r/267924 (https://phabricator.wikimedia.org/T124947) (owner: 10Eevans) [15:00:55] (03CR) 10Ottomata: [C: 032 V: 032] Change default consistency to localOne [puppet] - 10https://gerrit.wikimedia.org/r/267924 (https://phabricator.wikimedia.org/T124947) (owner: 10Eevans) [15:07:08] (03CR) 10JanZerebecki: [C: 04-1] "Err something doesn't work here with the .gitreview file." [puppet] - 10https://gerrit.wikimedia.org/r/267242 (https://phabricator.wikimedia.org/T111173) (owner: 10Aude) [15:07:31] !log rebooting oxygen for kernel update [15:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:08:03] !log disabling puppet on restbase cluster in preparation for configuration deploy (https://gerrit.wikimedia.org/r/#/c/266297/) [15:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:09:41] 6operations, 10Beta-Cluster-Infrastructure, 6Services: Move Node.JS services to Jessie and Node 4.2 - https://phabricator.wikimedia.org/T124989#1998036 (10Paladox) [15:11:09] <_joe_> !log restarting pybal on lvs200{3,6} [15:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:13:19] 6operations, 10OTRS: Error while sending emails with OTRS - https://phabricator.wikimedia.org/T125756#1998066 (10akosiaris) Memory usage appears to have stabilized at 4G. Way more than the previous version but acceptable nonetheless. I will monitor it for another day though. [15:13:35] godog: can you merge https://gerrit.wikimedia.org/r/#/c/266297/ for me, por favor? [15:14:20] (03PS9) 10Filippo Giunchedi: [production]: match restbase config to current Cassandra cluster [puppet] - 10https://gerrit.wikimedia.org/r/266297 (https://phabricator.wikimedia.org/T123869) (owner: 10Eevans) [15:14:26] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] [production]: match restbase config to current Cassandra cluster [puppet] - 10https://gerrit.wikimedia.org/r/266297 (https://phabricator.wikimedia.org/T123869) (owner: 10Eevans) [15:14:27] godog: thank you sir! [15:14:42] urandom: np! {{done}} [15:15:54] !log re-enabling puppet and forcing run on restbase1001.eqiad.wmnet (canary config deploy) [15:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:16:02] some 502 error [15:16:15] <_joe_> Vito: what url? [15:16:29] https://it.wikipedia.org/wiki/Speciale:Ripristina/Utente:Il_letto_%C3%A8_bello [15:17:03] <_joe_> Vito: I don't have permissions to see that page, but I'm not getting a 502 now [15:17:23] yep, that's why I said "some" ;) [15:17:27] it happens sometimes [15:17:38] <_joe_> I hope not so often [15:18:31] <_joe_> according to https://grafana.wikimedia.org/dashboard/db/varnish-http-errors only 1.3 requests every million are an error or a timeout [15:18:31] !log restarting restbase on restbase1001.eqiad.wmnet (config deploy) [15:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:18:49] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [15:20:31] 6operations, 10Traffic, 5Patch-For-Review, 7Pybal: pybal etcd coroutine crashed - https://phabricator.wikimedia.org/T125397#1998073 (10Joe) Problem reproduced and it's properly fixed when https://gerrit.wikimedia.org/r/268361 is applied. [15:20:55] (03Abandoned) 10Giuseppe Lavagetto: [WiP] Allow treating pooled=inactive differently from pooled=no in the etcd driver [debs/pybal] - 10https://gerrit.wikimedia.org/r/266728 (owner: 10Giuseppe Lavagetto) [15:21:37] (03CR) 10Giuseppe Lavagetto: [C: 032] "Tested on pybal-test2001, works as expected." [debs/pybal] - 10https://gerrit.wikimedia.org/r/268361 (https://phabricator.wikimedia.org/T125397) (owner: 10Giuseppe Lavagetto) [15:25:37] 6operations, 10ops-codfw: ms-be2015 doesn't come up after reboot - https://phabricator.wikimedia.org/T125383#1998079 (10fgiunchedi) 5Open>3Resolved reimage finished [15:34:07] !log reenabling puppet on restbase cluster (continue config deploy) [15:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:34:50] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 64.00% of data above the critical threshold [5000000.0] [15:35:39] 6operations, 10ops-eqiad, 10DBA: Decommission pc1001-1003 - https://phabricator.wikimedia.org/T124962#1998109 (10Cmjohnson) pc1001-1003 are CISCO servers and will be unracked. I am not sure if we will be able to use the disks in anything else but they will be removed and the original cisco disk put back in. [15:35:47] !log forcing puppet run on restbase cluster (config deploy) [15:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:41:19] 6operations, 10ops-eqiad, 5Patch-For-Review: SMART errors on kafka1012.eqiad.wmfnet - https://phabricator.wikimedia.org/T125199#1998130 (10Cmjohnson) I do have a disk on-site if still needed. megaraid controller shows all disks as good so we'll have to figure out the correct disk to replace. [15:44:26] !log rebooting db203[678] for kernel update [15:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:46:00] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0] [15:47:04] !log restbase cluster puppet run complete; performing rolling restart of restbase (applying https://gerrit.wikimedia.org/r/#/c/266297/) [15:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:53:51] !log rolling restbase restart complete [15:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:56:28] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 12.00% of data above the critical threshold [100000000.0] [15:56:39] 6operations, 10ops-eqiad, 5Patch-For-Review: SMART errors on kafka1012.eqiad.wmfnet - https://phabricator.wikimedia.org/T125199#1998160 (10elukey) Long tests finished, the only thing that I can see is that /dev/sdf (the one throwing I/O errors) has a bad sector in its defect list: ``` elukey@kafka1012:~$ f... [15:56:46] 6operations, 10ops-eqiad: RMA Juniper EX-UM-2X4SFP UPLINK - https://phabricator.wikimedia.org/T124436#1998162 (10Cmjohnson) 5Open>3Resolved Return shipping information UPS 1Z7AD3889025171943 [15:58:17] 6operations, 10ops-eqiad: mw1228 reporting readonly file system - https://phabricator.wikimedia.org/T122005#1998166 (10Cmjohnson) Disk replaced, Return shipping information USPS 9202 3946 5301 2430 6122 60 FEDEX 9611918 2393026 52103949 [15:59:35] PROBLEM - mysqld processes on db2030 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [16:00:05] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160204T1600). Please do the needful. [16:00:05] Luke081515: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [16:00:24] moritzm: db2030 is/was you? ^ [16:00:38] RECOVERY - swift codfw-prod object availability on graphite1001 is OK: OK: Less than 1.00% under the threshold [95.0] [16:00:48] Who will SWATß [16:00:50] *? [16:01:04] Luke081515: I can SWAT today. [16:01:08] thanks [16:01:13] https://gerrit.wikimedia.org/r/#/c/267804/ is it [16:01:21] only one patch today [16:02:04] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267804 (https://phabricator.wikimedia.org/T125448) (owner: 10Luke081515) [16:02:34] godog: yes, it's unused and I marked it for downtime, but apparently the monitoring is still up for some reason [16:02:49] moritzm: ack, thanks! [16:02:59] 6operations, 5Patch-For-Review: reinstall eqiad memcache servers with jessie - https://phabricator.wikimedia.org/T123711#1998201 (10elukey) @Dzahn, I wrote this wiki page: https://wikitech.wikimedia.org/wiki/User:Elukey/Ops/JessieMigration The idea would be to do 1004/5 tomorrow, and then try the wmf-reimage... [16:03:15] 6operations, 10Traffic, 7Pybal: pybal fails to detect dead servers under production lb IPs for port 80 - https://phabricator.wikimedia.org/T113151#1998203 (10Aklapper) This task has been "Unbreak now" priority since it was created and has seen no updates for nearly two months. [[ https://www.mediawiki.org/w... [16:03:38] (03PS2) 10Giuseppe Lavagetto: Make etcd updates more fault-tolerant [debs/pybal] - 10https://gerrit.wikimedia.org/r/268382 (https://phabricator.wikimedia.org/T125397) [16:03:40] godog: fixed [16:04:11] (03Merged) 10jenkins-bot: Enable confirmed group at nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267804 (https://phabricator.wikimedia.org/T125448) (owner: 10Luke081515) [16:04:58] (03CR) 10jenkins-bot: [V: 04-1] Make etcd updates more fault-tolerant [debs/pybal] - 10https://gerrit.wikimedia.org/r/268382 (https://phabricator.wikimedia.org/T125397) (owner: 10Giuseppe Lavagetto) [16:05:26] RECOVERY - mysqld processes on db2030 is OK: PROCS OK: 1 process with command name mysqld [16:06:37] hello. are we swatting? can i add a last-minute change? [16:06:47] thcipriani: ^ [16:07:06] MatmaRex: I am swatting, and sure, go for it. [16:07:23] thanks. give me a minute please [16:07:50] !log thcipriani@mira Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable confirmed group at nowiki [[gerrit:267804]] (duration: 02m 15s) [16:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:07:53] ^ Luke081515 check please [16:08:15] thcipriani: Works, thanks :) I will close the task now [16:08:21] Luke081515: thank you! [16:09:58] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [16:10:18] thcipriani: https://gerrit.wikimedia.org/r/#/c/268407/ for wmf.12. (added to wikitech:Deployments too.) [16:10:50] * thcipriani looks [16:14:05] (03PS3) 10JanZerebecki: Update WikidataBuildResources git source (github -> gerrit) [puppet] - 10https://gerrit.wikimedia.org/r/267242 (https://phabricator.wikimedia.org/T111173) (owner: 10Aude) [16:15:40] (03CR) 10Aude: "suppose the .gitreview for build resources conflicts with having one for the build result" [puppet] - 10https://gerrit.wikimedia.org/r/267242 (https://phabricator.wikimedia.org/T111173) (owner: 10Aude) [16:17:08] (03CR) 10JanZerebecki: [C: 031] "Ok now works. (Test result: https://gerrit.wikimedia.org/r/#/c/268406/1 )" [puppet] - 10https://gerrit.wikimedia.org/r/267242 (https://phabricator.wikimedia.org/T111173) (owner: 10Aude) [16:17:50] 6operations, 10Traffic, 7Pybal: pybal fails to detect dead servers under production lb IPs for port 80 - https://phabricator.wikimedia.org/T113151#1998273 (10Joe) @Aklapper the bug is solved in the code but needs to be deployed to production, which will happen very soon. [16:21:22] (03PS1) 10Merlijn van Deen: Set nlwiki collation to uca-nl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268409 (https://phabricator.wikimedia.org/T125774) [16:23:22] 6operations, 10ops-eqiad, 5Patch-For-Review: SMART errors on kafka1012.eqiad.wmfnet - https://phabricator.wikimedia.org/T125199#1998282 (10elukey) @Cmjohnson: if the missing temperature warnings are fine from your point of view I would change the /dev/sdf drive since it is the only one showing a sign of degr... [16:28:04] ugh, there's a really long queue for gate-and-submit. [16:28:33] MatmaRex: indeed. :\ [16:29:56] 6operations, 10Deployment-Systems, 6Release-Engineering-Team: /srv/mediawiki-staging broken on both scap masters - https://phabricator.wikimedia.org/T125506#1998305 (10greg) 5Open>3Resolved >>! In T125506#1994318, @greg wrote: > Keeping open until the report is posted, but for the purposes of pushing out... [16:30:28] 6operations, 10MediaWiki-API, 6Services, 10Traffic, 7Monitoring: Set up action API latency / error rate metrics & alerts - https://phabricator.wikimedia.org/T123854#1998311 (10GWicke) I have set up a basic latency and request rate dashboard using the Varnish metrics at https://grafana.wikimedia.org/dashb... [16:34:58] RECOVERY - Apache HTTP on mw1228 is OK: HTTP OK: HTTP/1.1 200 OK - 11783 bytes in 0.008 second response time [16:39:27] elukey, for my access request (https://gerrit.wikimedia.org/r/#/c/267919/) sofar the patch contains only my credentials. Should I also add the groups ? Or is it bad mojo to submit your own access change ? [16:39:55] (03CR) 10Aude: "thanks jan!" [puppet] - 10https://gerrit.wikimedia.org/r/267242 (https://phabricator.wikimedia.org/T111173) (owner: 10Aude) [16:41:09] RECOVERY - configured eth on mw1228 is OK: OK - interfaces up [16:41:14] gehel: not bad since it will need a +2 to get merged, if you want to work on it I don't see any issue :) Bare in mind though that the approval will need to wait next monday [16:41:16] thcipriani: it went through, finally. [16:41:18] RECOVERY - salt-minion processes on mw1228 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:41:20] RECOVERY - nutcracker process on mw1228 is OK: PROCS OK: 1 process with UID = 109 (nutcracker), command name nutcracker [16:41:27] MatmaRex: phew, ridiculous [16:41:29] RECOVERY - HHVM processes on mw1228 is OK: PROCS OK: 6 processes with command name hhvm [16:41:49] RECOVERY - Disk space on mw1228 is OK: DISK OK [16:41:49] RECOVERY - DPKG on mw1228 is OK: All packages OK [16:41:57] elukey, no problem for waiting. But if I can submit the code so that no one else has to, I'll do it .. [16:41:59] RECOVERY - dhclient process on mw1228 is OK: PROCS OK: 0 processes with command name dhclient [16:42:09] RECOVERY - RAID on mw1228 is OK: OK: no RAID installed [16:42:14] (03CR) 10Aude: "i think when making a build myself, don't think i copy over the dot files like .gitreview so we still have the old, correct one in Wikidat" [puppet] - 10https://gerrit.wikimedia.org/r/267242 (https://phabricator.wikimedia.org/T111173) (owner: 10Aude) [16:42:38] RECOVERY - nutcracker port on mw1228 is OK: TCP OK - 0.000 second response time on port 11212 [16:42:39] RECOVERY - Check size of conntrack table on mw1228 is OK: OK: nf_conntrack is 0 % full [16:42:41] gehel: I don't see any problem with it, please go ahead :) [16:42:59] elukey, wilco [16:43:28] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#1998342 (10Anomie) [16:45:23] !log thcipriani@mira Synchronized php-1.27.0-wmf.12/includes/media/Bitmap.php: SWAT: BitmapHandler: Implement validateParam() [[gerrit:268407]] (duration: 02m 08s) [16:45:25] ^ MatmaRex check please [16:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:46:18] RECOVERY - NTP on mw1228 is OK: NTP OK: Offset -0.01611125469 secs [16:46:33] 6operations, 10MediaWiki-General-or-Unknown, 10MediaWiki-Maintenance-scripts, 7Upstream: PHP 5.5.9 seems to have issues parsing some command line parameters via multiversion's MWScript - https://phabricator.wikimedia.org/T125748#1998348 (10Krenair) [16:46:45] looking [16:48:31] seems fine thcipriani. :) [16:48:48] MatmaRex: awesome. Thanks for checking and hanging out watching jenkins :) [16:52:03] 6operations, 10MediaWiki-General-or-Unknown, 10MediaWiki-Maintenance-scripts, 7Upstream: PHP 5.5.9 seems to have issues parsing some command line parameters via multiversion's MWScript - https://phabricator.wikimedia.org/T125748#1998365 (10Krenair) ```krenair@terbium:~$ cat /srv/mediawiki/php-1.27.0-wmf.12... [16:52:11] !log restarted apache2 on iridium so that phabricator recognizes sprint.phragile-uri [16:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:53:02] !log restarted phd to synchronize settings with phabricator [16:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:53:28] (03PS1) 10MarcoAurelio: Enabling Ext:ShortURL for maiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268414 (https://phabricator.wikimedia.org/T125802) [16:53:52] (03PS4) 10Gehel: Adding user gehel (Guillaume Lederrey) to user list and to necessary groups [puppet] - 10https://gerrit.wikimedia.org/r/267919 (https://phabricator.wikimedia.org/T125651) [16:55:09] (03PS1) 10Papaul: Add sarin to DNS entires. removed asset tag wmf5834 and wmf5835 was twice in the files. Put asset tag by alphabetical order. Bug:T125753 [dns] - 10https://gerrit.wikimedia.org/r/268415 (https://phabricator.wikimedia.org/T125753) [16:55:33] 6operations, 10RESTBase: Reduce log spam by removing non-operational cassandra IPs from seeds - https://phabricator.wikimedia.org/T123869#1998385 (10mobrovac) 5Open>3Resolved a:3Eevans [16:58:37] (03PS3) 10Giuseppe Lavagetto: Make etcd updates more fault-tolerant [debs/pybal] - 10https://gerrit.wikimedia.org/r/268382 (https://phabricator.wikimedia.org/T125397) [16:58:49] greg-g: I will probably need to deploy a JavaScript fix to wikidata soon-ish. https://gerrit.wikimedia.org/r/268411 [17:00:03] Task: https://phabricator.wikimedia.org/T125813 [17:00:04] moritzm mutante: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160204T1700). [17:00:05] gehel aude: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [17:00:45] * aude waves [17:00:57] * gehel waves to [17:03:03] (03PS2) 10Papaul: Add sarin to DNS entires. removed asset tag wmf5834 and wmf5835 was twice in the files. Put asset tag by alphabetical order. Bug:T125753 [dns] - 10https://gerrit.wikimedia.org/r/268415 (https://phabricator.wikimedia.org/T125753) [17:05:14] (03CR) 10Luke081515: [C: 031] Enable signature button for the Project namespace in ru.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267997 (https://phabricator.wikimedia.org/T125509) (owner: 10Dereckson) [17:07:23] 6operations, 10ops-codfw: onsite setup for sarin (WMF5851) - https://phabricator.wikimedia.org/T125753#1998453 (10Papaul) sarin 10.193.2.234 port ge-5/0/16 rack A5 [17:07:51] 6operations, 10ops-codfw: onsite setup for sarin (WMF5851) - https://phabricator.wikimedia.org/T125753#1998454 (10Papaul) [17:08:40] moritzm, mutante: first time I have a patch in deployment. Do you need anything from me ? [17:10:43] (03CR) 10Luke081515: [C: 04-1] "Currently blocked on community consensus (vote in progress)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268409 (https://phabricator.wikimedia.org/T125774) (owner: 10Merlijn van Deen) [17:10:44] 6operations, 10DBA, 10MediaWiki-Special-pages, 10Wikidata, 7Performance: Batch updates create slave lag on s3 over WAN - https://phabricator.wikimedia.org/T122429#1998478 (10hoo) [17:11:15] I guess Greg is not here today? [17:11:34] hoo: I am, sorry, was just in a meeting [17:11:37] hoo: doit [17:11:44] :) [17:11:53] (and then had to figure out some budget things quickly) [17:12:01] Nice, thanks :) [17:13:44] 6operations, 10MediaWiki-General-or-Unknown, 10MediaWiki-Maintenance-scripts, 7Upstream: PHP 5.5.9 seems to have issues parsing some command line parameters via multiversion's MWScript - https://phabricator.wikimedia.org/T125748#1998503 (10Krenair) >>! In T125748#1996646, @Smalyshev wrote: > Docs for getop... [17:15:42] 6operations, 10MediaWiki-General-or-Unknown, 10MediaWiki-Maintenance-scripts, 7Upstream: PHP 5.5.9's getopt does not respect global $argv like HHVM does, causing issues parsing command line parameters via multiversion's MWScript. - https://phabricator.wikimedia.org/T125748#1998511 (10Krenair) [17:16:33] 6operations, 10MediaWiki-Interface, 10Traffic, 5MW-1.27-release, and 5 others: Purge pages cached with mobile editlinks - https://phabricator.wikimedia.org/T125841#1998513 (10Danny_B) 3NEW [17:16:41] 6operations, 10OTRS, 6Security, 7HTTPS: SSL-config of the OTRS is outdated - https://phabricator.wikimedia.org/T91504#1998519 (10DaBPunkt) >>! In T91504#1996450, @Dzahn wrote: > https://www.ssllabs.com/ssltest/analyze.html?d=ticket.wikimedia.org grade A > > @DaBPunkt i would like to claim this is resolv... [17:17:46] 6operations, 10MediaWiki-Interface, 10Traffic, 5MW-1.27-release, and 5 others: Purge pages cached with mobile editlinks - https://phabricator.wikimedia.org/T125841#1998513 (10Danny_B) [17:18:44] (03PS1) 10Dereckson: Clean duplicate wgCopyUploadsDomains setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268419 [17:18:46] 6operations, 10MediaWiki-Interface, 10Traffic, 5MW-1.27-release, and 5 others: Broken mobile edit section links are showing up in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#1954680 (10Danny_B) I splitted the cache pu... [17:22:36] (03PS1) 10BBlack: add traffic-pool systemd service unit [puppet] - 10https://gerrit.wikimedia.org/r/268420 [17:22:38] 6operations, 10Salt: setup/deploy sarin(WMF5851) as a salt master in codfw - https://phabricator.wikimedia.org/T125752#1998550 (10Papaul) [17:23:09] 6operations, 10Salt: setup/deploy sarin(WMF5851) as a salt master in codfw - https://phabricator.wikimedia.org/T125752#1996363 (10Papaul) [17:23:28] (03CR) 10Ema: [C: 032 V: 032] Skip tests if varnishtest is not installed [software/varnish/libvmod-header] (debian) - 10https://gerrit.wikimedia.org/r/268385 (https://phabricator.wikimedia.org/T124281) (owner: 10Ema) [17:23:36] Luke081515: could you take care to include mdann52 patches in a SWAT window? [17:24:28] (03CR) 10jenkins-bot: [V: 04-1] add traffic-pool systemd service unit [puppet] - 10https://gerrit.wikimedia.org/r/268420 (owner: 10BBlack) [17:25:08] (03PS2) 10BBlack: add traffic-pool systemd service unit [puppet] - 10https://gerrit.wikimedia.org/r/268420 [17:25:15] 6operations, 10Traffic, 5Patch-For-Review: Create separate packages for required vmods - https://phabricator.wikimedia.org/T124281#1998562 (10ema) libvmod-vslp packaged and [[https://gerrit.wikimedia.org/r/#/admin/projects/operations/software/varnish/libvmod-vslp|pushed to gerrit]]. [17:25:33] 6operations, 10RESTBase, 10hardware-requests: normalize eqiad restbase cluster - replace restbase1001-1006 - https://phabricator.wikimedia.org/T125842#1998563 (10RobH) 3NEW a:3RobH [17:28:09] mutante:puppet swat today? [17:29:17] moritzm: puppet swat today? [17:29:24] !log (Re)started wdqs-updater on wdqs1001, but seems it doesn't work [17:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:29:30] 6operations, 10RESTBase, 10hardware-requests: normalize eqiad restbase cluster - replace restbase1001-1006 - https://phabricator.wikimedia.org/T125842#1998577 (10RobH) [17:29:33] 7Blocked-on-Operations, 6operations, 10RESTBase, 10hardware-requests: Expand SSD space in Cassandra cluster - https://phabricator.wikimedia.org/T121575#1998576 (10RobH) [17:30:47] !log hoo@mira Synchronized php-1.27.0-wmf.12/extensions/Wikidata: Fix editing terms in languages other than the interface language via the term box (duration: 02m 18s) [17:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:32:01] confirmed fixed [17:34:19] (03CR) 10Luke081515: [C: 031] Adding museumvictoria.com.au domain to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267677 (https://phabricator.wikimedia.org/T125387) (owner: 10MarcoAurelio) [17:34:45] (03CR) 10Luke081515: [C: 031] Enabling Ext:ShortURL for maiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268414 (https://phabricator.wikimedia.org/T125802) (owner: 10MarcoAurelio) [17:35:18] (03CR) 10Luke081515: [C: 031] Removing testwiki from wmgUseShortUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267784 (owner: 10MarcoAurelio) [17:35:33] (03CR) 10Luke081515: [C: 031] Enabling Extension:ShortUrl for bhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267783 (https://phabricator.wikimedia.org/T113348) (owner: 10MarcoAurelio) [17:35:53] 7Blocked-on-Operations, 6operations, 10RESTBase, 10hardware-requests: Expand SSD space in Cassandra cluster - https://phabricator.wikimedia.org/T121575#1998638 (10RobH) [17:35:57] gehel aude I'll be swatting puppet, gehel you are up first! [17:36:44] * aude here [17:36:47] godog, great ! [17:37:54] (03PS2) 10Filippo Giunchedi: Adding monitoring of some ElasticSearch thread pools [puppet] - 10https://gerrit.wikimedia.org/r/268384 (https://phabricator.wikimedia.org/T125782) (owner: 10Gehel) [17:37:57] (03PS1) 10Papaul: Add prod DNS entries for sarin Bug:T125752 [dns] - 10https://gerrit.wikimedia.org/r/268422 (https://phabricator.wikimedia.org/T125752) [17:38:01] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Adding monitoring of some ElasticSearch thread pools [puppet] - 10https://gerrit.wikimedia.org/r/268384 (https://phabricator.wikimedia.org/T125782) (owner: 10Gehel) [17:38:58] ebernhardson: that was copied incorrectly on the Deployments page, mutante and myself and doing puppet swat next week, this week it's _joe_ and ema [17:39:21] 6operations, 10Salt: setup/deploy sarin(WMF5851) as a salt master in codfw - https://phabricator.wikimedia.org/T125752#1998654 (10Papaul) [17:40:08] moritzm: ahh ok, happens. [17:41:43] gehel: I'm verifying the diamond collector [17:42:10] godog, thanks ! I dont have access yet to validate much ... [17:42:29] i'll check graphite to see if they start coming in, will probably take a minute or two [17:43:19] yup, LGTM [17:44:16] <_joe_> moritzm: i didn't get pinged :/ [17:44:20] <_joe_> sorry [17:44:26] <_joe_> I'm here now [17:44:54] <_joe_> godog: need me to take over? [17:45:30] _joe_: nah it is fine, I'm looking at the other patch while ebernhardson confirms the metrics are flowing [17:45:37] namely https://gerrit.wikimedia.org/r/#/c/267242/3 [17:46:03] no metrics yet, but not sure how long should wait on graphite [17:46:33] _joe_: there seems to have been a copy&paste bug in the Deployments wiki page, mutante and myself (who are next week on puppet swat) were listed there, so we got pinged instead [17:46:43] and I only saw that 5 mins ago [17:46:44] <_joe_> moritzm: yeah got it [17:46:48] <_joe_> lol, ok [17:46:57] <_joe_> it was me and ema this week [17:47:42] <_joe_> aude: I'm unsure how a change of origin there will work [17:47:47] <_joe_> how/if [17:47:48] ebernhardson: a couple of minutes should be enough, I saw this being sent by elastic1001 elasticsearch.production-search-eqiad.elasticsearch.thread_pool.search.active:8|g [17:48:16] godog, that looks like my patch [17:48:33] I don't think it'll do anything _joe_ I was looking at git::clone [17:48:59] godog: hmm, that is only for the master, these should per per-node metrics in server.elastic*.elasticsearch.thread_pool.* [17:49:08] patch added them to per node as well, hmm [17:49:18] <_joe_> godog: yeah that was how i remember git::clone [17:49:52] <_joe_> godog: so merge it, if the cron change looks good, I'll change the url manually via salt [17:49:59] _joe_: not totally sure though jzerebecki already tried it [17:50:37] for vagrant, i had to add a command to do git remote set-url ... [17:50:38] https://gerrit.wikimedia.org/r/#/q/status:open,n,z [17:51:23] https://gerrit.wikimedia.org/r/#/c/267864/8/puppet/modules/role/manifests/wikidata.pp [17:51:39] _joe_: ok, maybe easier if you merge it too, I'll keep looking at the other patch [17:51:45] <_joe_> ok [17:52:11] godog: how did you see which metrics diamond is shipping out? [17:52:24] * ebernhardson cano only guess strace -e trace=network, but seems overkill :) [17:52:55] <_joe_> ebernhardson: tcpdump is way better [17:53:00] <_joe_> anyways [17:53:15] <_joe_> aude: I'm gonna merge your patch and verify what happens [17:53:23] _joe_: ok [17:53:37] ebernhardson: hehe sth like that, tcpdump... [17:54:04] (03PS4) 10Giuseppe Lavagetto: Update WikidataBuildResources git source (github -> gerrit) [puppet] - 10https://gerrit.wikimedia.org/r/267242 (https://phabricator.wikimedia.org/T111173) (owner: 10Aude) [17:54:31] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#1998727 (10bd808) >>! In T124440#1987589, @Anomie wrote: > See also https://gerrit.wikimedia.org/r/#/c/267734/ and https://gerrit.wikimedia.org/r/#/c/267735/. These... [17:54:47] <_joe_> aude: where is that module used? not in production AFAICS [17:55:20] 7Blocked-on-Operations, 6operations, 10RESTBase, 10hardware-requests: Expand SSD space in Cassandra cluster - https://phabricator.wikimedia.org/T121575#1998734 (10RobH) restbase1001-1006 are now slated to be replaced with new hosts, the ordering of the new hosts and replacement of the existing ones is trac... [17:55:44] <_joe_> aude: ? [17:56:21] <_joe_> hoo: maybe you know? Is there some labs project that uses it? [17:56:40] _joe_: The build resources? [17:56:52] In labs, yes [17:56:53] <_joe_> hoo: the puppet module wikidatabuilder [17:57:00] yes, labs [17:57:01] <_joe_> ok, which machine? [17:57:11] uh, let me think [17:57:23] wikidata-builder1.eqiad.wmflabs [17:57:27] not sure that's the current one [17:57:32] but I think that's it [17:57:42] seeing new metrics for some elasticsearch nodes in graphite, but not all [17:57:49] (03CR) 10Giuseppe Lavagetto: [C: 032] "This is not used in production, I merge this noting that the git remote url will need to be changed manually on machines already using thi" [puppet] - 10https://gerrit.wikimedia.org/r/267242 (https://phabricator.wikimedia.org/T111173) (owner: 10Aude) [17:58:21] <_joe_> I'll just modify the url manually on that machine [17:58:27] _joe_: yes on labs [17:58:46] i'll say it's probably working, and perhaps a few of the nodes still need a puppet run [17:58:57] _joe_: ok, thanks [18:00:04] yurik gwicke cscott arlolra subbu bearND mdholloway: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160204T1800). [18:00:39] <_joe_> hoo: Notice: Skipping run of Puppet configuration client; administratively disabled (Reason: 'pending https://phabricator.wikimedia.org/T111173'); [18:00:43] <_joe_> aude: ^^ [18:00:49] <_joe_> on that labs host [18:01:04] <_joe_> so I guess someone else will take care of it now that the patch is merged [18:01:06] jzerebecki probably did that [18:01:10] <_joe_> ok [18:01:24] sure he can take care of it afte rthe patch is merged [18:01:28] <_joe_> we're done then [18:01:32] <_joe_> patch is merged [18:01:33] (03PS1) 10Papaul: Add sarin to DHCP Bug:T125752 [puppet] - 10https://gerrit.wikimedia.org/r/268425 (https://phabricator.wikimedia.org/T125752) [18:01:34] thanks :) [18:01:36] <_joe_> nothign to test in fact [18:01:47] ebernhardson gehel yup, e.g. elastic1001 I've forced a puppet run there after merge [18:01:53] if you or anyone also wants to review my vagrant patch that it is sane, would be apprecaited [18:02:03] (03CR) 10Ori.livneh: [C: 031] "Looks good. Great job with the tests. One tiny comment inside." (032 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/268382 (https://phabricator.wikimedia.org/T125397) (owner: 10Giuseppe Lavagetto) [18:02:03] don't have to merge it [18:02:08] <_joe_> aude: I'm actually calling it a day [18:02:20] _joe_: ok [18:02:38] godog, so the others are going to come up with standard puppet run ? On what period do those run ? [18:02:50] <_joe_> gehel: every 20 minutes [18:02:55] <_joe_> + splay [18:03:11] <_joe_> so within 25 minutes or so it's expected to have run everywhere [18:03:19] gwicke cscott arlolra subbu bearND mdholloway: i plan to deploy tilerator service shortly [18:04:21] we don't have anything to deploy for mobileapps. Not sure why we were included in this deploy window. Maybe just a copy&paste from an older one [18:04:29] _joe_: TIL it is every 30m, I was looking at /etc/cron.d/puppet [18:04:37] 6operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: Hadoop Node expansion for end of FY - https://phabricator.wikimedia.org/T124951#1998793 (10Milimetric) [18:05:15] 6operations, 10Analytics, 10hardware-requests, 5Patch-For-Review: 8 x 3 SSDs for AQS nodes. - https://phabricator.wikimedia.org/T124947#1998797 (10Milimetric) [18:06:00] (03PS4) 10Giuseppe Lavagetto: Make etcd updates more fault-tolerant [debs/pybal] - 10https://gerrit.wikimedia.org/r/268382 (https://phabricator.wikimedia.org/T125397) [18:06:12] <_joe_> godog: oh we reduced it? [18:06:20] <_joe_> makes sense... [18:06:45] yeah I seemed to remember every 20m too [18:07:57] (03CR) 10Giuseppe Lavagetto: Make etcd updates more fault-tolerant (032 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/268382 (https://phabricator.wikimedia.org/T125397) (owner: 10Giuseppe Lavagetto) [18:08:11] 6operations, 10Salt: setup/deploy sarin(WMF5851) as a salt master in codfw - https://phabricator.wikimedia.org/T125752#1998817 (10Papaul) @Robh. Do you know what partman schema we are using for this system? [18:09:34] 6operations, 10Analytics, 10Analytics-Cluster: Kafka Broker disk usage is imbalanced - https://phabricator.wikimedia.org/T99105#1998824 (10Milimetric) 5Open>3stalled No actionables. If we install we'll use RAID [18:09:59] 6operations, 10Analytics, 10Analytics-Cluster: Kafka Broker disk usage is imbalanced - https://phabricator.wikimedia.org/T99105#1998826 (10Milimetric) 5stalled>3Resolved a:3Milimetric [18:11:46] _joe_: PS4 LGTM, should I merge or just +1 and leave it to you? [18:12:12] <_joe_> ori: merge it :) [18:17:36] actually [18:17:41] i have one other suggestion sec [18:18:42] bearND, this window is for all services, feel free to remove [18:19:06] oh, never mind. I was going to see that this would leave us with no useful log data to differentiate json decoding errors from update failures but that's not true, because failure.Failure() is magical and captures the current exception context [18:19:12] *to say [18:19:22] yurik: it's fine [18:19:31] (03CR) 10Ori.livneh: [C: 032] Make etcd updates more fault-tolerant [debs/pybal] - 10https://gerrit.wikimedia.org/r/268382 (https://phabricator.wikimedia.org/T125397) (owner: 10Giuseppe Lavagetto) [18:19:50] https://wikitech.wikimedia.org/wiki/Incident_documentation/20160204-Phabricator [18:20:05] jouncebot: next [18:20:05] In 1 hour(s) and 39 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160204T2000) [18:22:21] 6operations, 10Salt: setup/deploy sarin(WMF5851) as a salt master in codfw - https://phabricator.wikimedia.org/T125752#1998966 (10RobH) @papaul: the raid1-lvm-ext4-srv.cfg option. Please note that the install for jessie cannot actually progress until T125256 is resolved (due to the installer bug.) [18:22:31] 6operations, 10Salt: setup/deploy sarin(WMF5851) as a salt master in codfw - https://phabricator.wikimedia.org/T125752#1998972 (10RobH) [18:24:36] (03PS1) 10Chad: Move mwblocker.log to private data directory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268431 [18:24:39] (03PS1) 10Chad: Stop ignoring mwblocker.log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268432 [18:27:47] (03PS1) 10ArielGlenn: dumps: make dumps cron script maintainable, dryrun option [puppet] - 10https://gerrit.wikimedia.org/r/268433 [18:28:10] 7Blocked-on-Operations, 6operations, 10RESTBase, 10hardware-requests: Expand SSD space in Cassandra cluster - https://phabricator.wikimedia.org/T121575#1999037 (10RobH) [18:31:30] (03PS2) 10ArielGlenn: dumps: make dumps cron script maintainable, dryrun option [puppet] - 10https://gerrit.wikimedia.org/r/268433 [18:31:39] (03Merged) 10jenkins-bot: Make etcd updates more fault-tolerant [debs/pybal] - 10https://gerrit.wikimedia.org/r/268382 (https://phabricator.wikimedia.org/T125397) (owner: 10Giuseppe Lavagetto) [18:31:43] (03PS3) 10BBlack: add traffic-pool systemd service unit [puppet] - 10https://gerrit.wikimedia.org/r/268420 [18:31:45] (03PS1) 10BBlack: fix output typo in "depool" [puppet] - 10https://gerrit.wikimedia.org/r/268434 [18:32:19] (03CR) 10BBlack: [C: 032 V: 032] fix output typo in "depool" [puppet] - 10https://gerrit.wikimedia.org/r/268434 (owner: 10BBlack) [18:33:41] <_joe_> bblack: meh [18:34:14] twentyafterfour: > It's still unclear why it took this long for puppet to change the state of the repositories. [18:34:34] I could have misread the log, but according to /var/log/puppet.log(.*) , Puppet did not change the state of the repositories a second time [18:35:04] moritzm, how did you disable tilerator services? i cannot reenable them with sudo service tilerator enable [18:35:26] 6operations, 10ops-eqiad: mw1228 reporting readonly file system - https://phabricator.wikimedia.org/T122005#1999067 (10Cmjohnson) a:5Cmjohnson>3Joe new OS installed...assigning to @joe to add back to the cluser [18:35:34] <_joe_> yurik: because that's a wrong command :) [18:35:51] yurik: systemctl stop tilerator.service [18:35:51] _joe_, thx, i just realized that i should be using sudo systemctl unmask tilerator.service [18:35:56] i'm silly, sorry [18:35:56] ori: I'm reviewing the log myself. It must have been puppet [18:36:21] 6operations, 10ops-codfw: es2010 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T117848#1999072 (10Cmjohnson) Re-assigning back to papaul. I did not have the disks here [18:36:33] 6operations, 10ops-codfw: es2010 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T117848#1999073 (10Cmjohnson) a:5Cmjohnson>3Papaul [18:36:36] Can I get a root to `chown mwdeploy:wikidev /srv/mediawiki-staging/private/.{git,gitignore}` on mira? [18:36:51] <_joe_> cmjohnson1: thanks, I'll probably do that with elukey tomorrow [18:37:02] <_joe_> ostriches: yep, what happened? [18:37:22] <_joe_> oh right [18:37:24] root:root took ownership at some point [18:37:25] :) [18:37:29] <_joe_> I left that owned by root:root [18:37:36] <_joe_> no I did that on purpose I guess :P [18:37:41] Ah ok [18:37:49] <_joe_> I wasn't sure who had access before [18:37:55] <_joe_> so I erred on the side of caution [18:37:57] I'm trying to move some private stuff into private/ and out of .gitignore :) [18:37:59] grr why doesn't puppet log have timestamps [18:38:35] <_joe_> ostriches: 1 sec [18:38:47] <_joe_> twentyafterfour: they are timestamped in syslog [18:39:16] <_joe_> ostriches: done [18:39:19] ori, twentyafterfour: the problems starting at 6:31 (minus a few minutes until the monitoring kicks in), is suspiciously close to the cron.daily run at 6:25 [18:40:05] _joe_: ty! [18:40:07] <_joe_> ostriches: I'm afk now [18:40:13] <_joe_> dinner awaits [18:40:20] !log deployed and reenabled tilerator & tileratorui [18:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:40:27] moritzm: see my ops mail on it [18:40:29] (03CR) 10Chad: [C: 032] Move mwblocker.log to private data directory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268431 (owner: 10Chad) [18:40:30] my second one [18:40:53] ori, same to you [18:41:34] (03Merged) 10jenkins-bot: Move mwblocker.log to private data directory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268431 (owner: 10Chad) [18:41:43] (03PS4) 10BBlack: add traffic-pool systemd service unit [puppet] - 10https://gerrit.wikimedia.org/r/268420 [18:41:46] twentyafterfour: same to you [18:42:06] hmm [18:42:15] (03PS1) 10Jcrespo: Add access to m5-master:testreduce* dbs for ssastry on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/268438 (https://phabricator.wikimedia.org/T125435) [18:42:39] well in the puppet.log I see two instances of puppet screwing things up, just need to cross-reference timestamps [18:42:44] ori ^ [18:42:54] puppet.log.1.gz actually [18:43:08] the second one just sat there waiting to get us until [18:43:15] apache restarted with log rotation [18:43:19] and then, bam [18:43:56] 6:28 was Nikerabbit's first report [18:43:59] 10Ops-Access-Requests, 10Ops-Access-Reviews, 6operations, 3Discovery-Search-Sprint, 5Patch-For-Review: Access for new Discovery OpsEng: Guillaume Lederrey - https://phabricator.wikimedia.org/T125651#1999114 (10Tfinc) Approved [18:45:32] !log demon@mira Synchronized private/mwblocker.log: (no message) (duration: 02m 10s) [18:45:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:46:24] if you look in /var/log/syslog (.1?) you can get the timestamps [18:46:46] (03CR) 10Chad: [C: 032] Stop ignoring mwblocker.log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268432 (owner: 10Chad) [18:47:24] apergos: that would explain it alright. I didn't think of logrotation [18:47:39] who did [18:47:45] wheee, gitignore-- [18:47:59] ostriches: :) [18:48:26] .gitignore: .gitignore [18:49:07] little known fact, that makes git die in a forever ending cycle [18:49:13] (03PS3) 10ArielGlenn: dumps: make dumps cron script maintainable, dryrun option [puppet] - 10https://gerrit.wikimedia.org/r/268433 [18:49:42] !log demon@mira Synchronized wmf-config/: removing old mwblocker.log (duration: 02m 07s) [18:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:51:03] (03CR) 10ArielGlenn: [C: 032] dumps: make dumps cron script maintainable, dryrun option [puppet] - 10https://gerrit.wikimedia.org/r/268433 (owner: 10ArielGlenn) [18:51:43] ostriches: I've got a beta only config change that jdlrobson would like to get deployed. Can that fit in front of the train or should we wait until after? [18:51:53] patch is https://gerrit.wikimedia.org/r/#/c/268202/ [18:52:16] apergos: I am updating the timeline to account for the apache log rotation restart. [18:52:31] but I still don't understand this error message from puppet: Error: /Stage[main]/Phabricator/Git::Install[phabricator/phabricator]/Exec[git_update_phabricator/phabricator]: Could not evaluate: invalid byte sequence in UTF-8 [18:52:31] ok great [18:52:42] I don't either and I don't care [18:53:01] I mean, it tried to update some things, it did poorly, that's what concerns me most [18:53:03] :-D [18:53:27] (not saying it might not have meaning for you as phab maintainer0 [18:53:29] ) [18:53:37] I'd like to know where the invalid byte sequence comes from [18:53:37] bd808: looking [18:53:55] apergos: not specifically asking you to explain it though [18:54:04] good thing too :-D [18:54:08] just kind of thinking out loud [18:54:11] sure sure [18:54:26] bd808: doing [18:54:32] (03CR) 10Chad: [C: 032] Config change 2: Suppress HTML from initial stable views on BC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268202 (https://phabricator.wikimedia.org/T124959) (owner: 10Jdlrobson) [18:54:37] where's that rubber ducky when I need it [18:54:38] ostriches: my hero. thanks [18:54:58] 6operations, 6Discovery, 7Elasticsearch, 7Epic: EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade) - https://phabricator.wikimedia.org/T109089#1999191 (10Deskana) In the absence of anything else more pressing, this would be a good one for @gehel to take a look at. @ebernha... [18:55:16] jdlrobson: buy ostriches a coffee the next time he accidentally shows up at the office :) [18:55:40] (03Merged) 10jenkins-bot: Config change 2: Suppress HTML from initial stable views on BC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268202 (https://phabricator.wikimedia.org/T124959) (owner: 10Jdlrobson) [18:55:51] (03CR) 10Mobrovac: [C: 04-1] ruthenium: Some more tweaks to parsoid + visualdiffing services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/268326 (owner: 10Subramanya Sastry) [18:55:56] https://www.youtube.com/watch?v=bf9d7rSf_Ks twentyafterfour [18:56:20] (I think the last time I saw that was when it was first on tv :-P) [18:57:22] (03CR) 10Mobrovac: [C: 031] parsoid-rt-client: Have testreduce clients use global parsoid service [puppet] - 10https://gerrit.wikimedia.org/r/268230 (owner: 10Subramanya Sastry) [18:57:42] :) [18:59:22] will do ostriches ;-) [18:59:49] PROBLEM - HHVM rendering on mw1115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:59:58] PROBLEM - Apache HTTP on mw1115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:00:18] !log demon@mira Synchronized wmf-config/InitialiseSettings-labs.php: prod no op for completeness (duration: 03m 02s) [19:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:03:31] (03CR) 10Mobrovac: [C: 031] "The kernel upgrade is done. Let's coordinate on the deployment of this patch so that I can verify the domain creation process in RB." [puppet] - 10https://gerrit.wikimedia.org/r/268016 (https://phabricator.wikimedia.org/T125501) (owner: 10Dereckson) [19:03:32] 6operations, 10Dumps-Generation: Make dumps run via cron on each snapshot host - https://phabricator.wikimedia.org/T107750#1999230 (10ArielGlenn) Third time's a charm, after a rewrite, adding a dryrun option and a ton more testing on all the hosts of the dumps cron script we should be set to pick up tomorrow w... [19:03:39] (03PS1) 10Chad: Remove cdb file ignores for wikiversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268441 [19:03:46] (03PS5) 10BBlack: add traffic-pool systemd service unit [puppet] - 10https://gerrit.wikimedia.org/r/268420 [19:04:58] _joe_: subscribed to the task ;) [19:05:09] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 69.23% of data above the critical threshold [5000000.0] [19:05:28] (03CR) 10jenkins-bot: [V: 04-1] add traffic-pool systemd service unit [puppet] - 10https://gerrit.wikimedia.org/r/268420 (owner: 10BBlack) [19:07:15] (03PS1) 10Chad: Remove a bunch of ignored things that don't exist and shouldn't [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268442 [19:07:50] (03PS6) 10BBlack: add traffic-pool systemd service unit [puppet] - 10https://gerrit.wikimedia.org/r/268420 [19:08:18] PROBLEM - puppet last run on mw1117 is CRITICAL: CRITICAL: Puppet has 93 failures [19:08:36] (03CR) 10Chad: [C: 032] Remove cdb file ignores for wikiversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268441 (owner: 10Chad) [19:08:47] (03CR) 10Chad: [C: 032] Remove a bunch of ignored things that don't exist and shouldn't [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268442 (owner: 10Chad) [19:09:26] (03Merged) 10jenkins-bot: Remove cdb file ignores for wikiversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268441 (owner: 10Chad) [19:09:36] (03Merged) 10jenkins-bot: Remove a bunch of ignored things that don't exist and shouldn't [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268442 (owner: 10Chad) [19:11:27] (03PS3) 10Jdlrobson: On Beta Cluster: Use different logo for login form [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243732 (https://phabricator.wikimedia.org/T115078) [19:13:19] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:13:20] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:13:21] (03CR) 10BryanDavis: [C: 031] On Beta Cluster: Use different logo for login form [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243732 (https://phabricator.wikimedia.org/T115078) (owner: 10Jdlrobson) [19:14:04] bd808: sorry that took me so long to get to. Fell of my radar [19:14:56] jdlrobson: heh. when it popped up in my email I didn't even remember commenting on it before [19:15:18] You should add it to the evening SWAT [19:15:28] (03PS3) 10Dzahn: Add sarin to DNS entires [dns] - 10https://gerrit.wikimedia.org/r/268415 (https://phabricator.wikimedia.org/T125753) (owner: 10Papaul) [19:16:10] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [19:16:23] (03PS1) 10Chad: Remove obsolete image ignore rules from .gitignore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268444 [19:16:39] (03CR) 10Chad: [C: 032 V: 032] Remove obsolete image ignore rules from .gitignore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268444 (owner: 10Chad) [19:16:59] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [19:17:02] (03PS2) 10Jcrespo: Add access to m5-master:testreduce* dbs for ssastry on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/268438 (https://phabricator.wikimedia.org/T125435) [19:17:08] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:17:09] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:18:06] (03CR) 10Dzahn: [C: 032] Add sarin to DNS entires [dns] - 10https://gerrit.wikimedia.org/r/268415 (https://phabricator.wikimedia.org/T125753) (owner: 10Papaul) [19:18:36] (03PS1) 10Chad: Document why /logs is in gitignore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268445 [19:18:43] (03CR) 10Subramanya Sastry: ruthenium: Some more tweaks to parsoid + visualdiffing services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/268326 (owner: 10Subramanya Sastry) [19:18:57] (03CR) 10Chad: [C: 032 V: 032] Document why /logs is in gitignore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268445 (owner: 10Chad) [19:20:43] Dear Ops... To get familiar with our Puppet code base, I did some analysis with SonarQube. It might make sense to write a short mail about it. Where should I send it? [19:21:19] gehel: ops@lists.wikimedia.org, probably [19:21:28] (03PS2) 10Dzahn: Add prod DNS entries for sarin Bug:T125752 [dns] - 10https://gerrit.wikimedia.org/r/268422 (https://phabricator.wikimedia.org/T125752) (owner: 10Papaul) [19:22:23] (03PS1) 10Jforrester: Follow-up ce24fd2: Don't set things to 'true' when you mean 'false' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268446 (https://phabricator.wikimedia.org/T125850) [19:22:27] gehel: yea, that list. i think you have been subscribed the other day, right [19:22:29] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [19:23:17] (03CR) 10Dzahn: [C: 032] Add prod DNS entries for sarin Bug:T125752 [dns] - 10https://gerrit.wikimedia.org/r/268422 (https://phabricator.wikimedia.org/T125752) (owner: 10Papaul) [19:23:31] mutante, greg-g yes, I should be subscribed. I'll send my mail tomorrow. Thanks! [19:24:43] elukey: fyi, cmjohnson1 is about to swap sdf out [19:25:03] i started making a backup of the contents of /var/spool/kafka/f on /var/spool/kafka/e [19:25:04] gogogogo [19:25:06] but, it was very slow [19:25:12] 6operations, 10Salt: setup/deploy sarin(WMF5851) as a salt master in codfw - https://phabricator.wikimedia.org/T125752#1999344 (10Dzahn) merged @Papaul's DNS changes. sarin exists now. ``` [radon:~] $ host sarin.codfw.wmnet sarin.codfw.wmnet has address 10.192.0.140 [radon:~] $ host 10.192.0.140 140.0.192.10... [19:25:12] about 100G in the last hour [19:25:16] 1.1T total to do [19:25:25] figured by the time we finisehd all that [19:25:30] the data on f would be stale anyway [19:25:35] since it expires after 7 days [19:26:35] so i stopped it, and told cmjohnson1 to just swap the disk [19:26:43] +1 [19:26:47] ok [19:27:19] * elukey goes offline but will re-check the channel later on [19:27:22] ok cool [19:27:38] done [19:27:45] (03CR) 10Chad: [C: 032] Follow-up ce24fd2: Don't set things to 'true' when you mean 'false' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268446 (https://phabricator.wikimedia.org/T125850) (owner: 10Jforrester) [19:28:24] 6operations, 10ops-eqiad, 5Patch-For-Review: SMART errors on kafka1012.eqiad.wmfnet - https://phabricator.wikimedia.org/T125199#1999359 (10Cmjohnson) swapped the disk [19:29:08] apergos, twentyafterfour: i mentioned yesterday -- 00:04 ori: cron.{daily,weekly,monthly} run at 6:25, 6:47 and 6:52 [19:29:17] but i don't think that was it, either [19:29:24] (03Merged) 10jenkins-bot: Follow-up ce24fd2: Don't set things to 'true' when you mean 'false' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268446 (https://phabricator.wikimedia.org/T125850) (owner: 10Jforrester) [19:29:25] ori: did you read my mail? [19:29:32] logrotate doesn't restart apache, it sends a SIGUSR1 to relaod the config and reopen log files [19:29:35] cmjohnson1: cool don't see /dev/sdf anymore, should I get a new device? [19:29:36] oh, not yet. let me take a look [19:30:05] ottomata: you need to manually add it back [19:30:17] apergos: which e-mail is that? [19:30:18] apergos: where would I find that email [19:30:23] I sent it to ops [19:30:34] it was a follow on to ori's email about the outage [19:30:44] i just saw >I've added a 'troubleshooting' section to the wikitech Phab page at the top, describing the 'quick fix' until scap3 is in place via puppet. [19:30:45] exactly after the lgorotation we see errors [19:30:50] no, the next one [19:30:55] there were two emails, sorry [19:30:59] * ori doesn't see it [19:31:02] er [19:32:01] might have not reply all somehow [19:32:06] sigh [19:32:48] !log demon@mira Synchronized wmf-config/InitialiseSettings.php: T125850 (duration: 02m 11s) [19:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:32:59] James_F: ^^ [19:33:34] could SIGUSR1 cause apc to reload it's cache? [19:33:42] I am guessing it would have [19:34:30] ostriches: Thanks. [19:35:37] yw [19:36:17] http://stackoverflow.com/questions/2785533/does-a-graceful-apache-restart-clear-apc [19:36:23] sayeth stackoverflow [19:36:43] yep that would explain it then [19:36:51] ori ^ [19:37:06] good detective work, apergos [19:37:12] thanks! [19:37:25] my forwarded email should have arrived at your inboxes now [19:37:30] eyeroll [19:37:59] RECOVERY - puppet last run on mw1117 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:42:58] mobrovac: wanna do the "ady.wp" creation in restbase? [19:43:15] !log rebooting kafka1012 [19:43:28] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:43:38] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:44:06] sure mutante [19:44:08] I wish I knew what keys.txt was [19:44:34] mobrovac: ok, cool, i will merge https://gerrit.wikimedia.org/r/#/c/268016/ [19:44:48] k [19:45:18] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [19:45:19] (03PS2) 10Dzahn: RESTBase and Labs DNS configuration for ady.wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/268016 (https://phabricator.wikimedia.org/T125501) (owner: 10Dereckson) [19:45:28] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [19:45:31] (03CR) 10Dzahn: [C: 032] RESTBase and Labs DNS configuration for ady.wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/268016 (https://phabricator.wikimedia.org/T125501) (owner: 10Dereckson) [19:46:00] twentyafterfour: do you wanna update the incident report with the cause of the outage then? [19:46:10] apergos: I am updating it now [19:46:16] sweet! [19:46:23] there's just one detail that still doesn't make sense - ls /var/run/phab_repo_lock* -la [19:46:25] -rw-r--r-- 1 root root 0 Feb 4 07:13 /var/run/phab_repo_lock [19:46:27] -rw-r--r-- 1 root root 0 Feb 4 01:05 /var/run/phab_repo_lock_libext_security [19:46:29] -rw-r--r-- 1 root root 0 Feb 4 01:05 /var/run/phab_repo_lock_libext_Sprint [19:46:44] people touched the lock later [19:47:04] mobrovac: now it's actually merged on master [19:47:21] (03PS1) 10Hashar: all wikis to 1.27.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268452 [19:47:28] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [19:47:28] PROBLEM - IPsec on cp2015 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1012_v4,kafka1012_v6 [19:47:30] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [19:47:30] PROBLEM - IPsec on cp2021 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1012_v4,kafka1012_v6 [19:47:33] mutante: ok, i'll first run puppet in staging and confirm it's ok and then to prod [19:47:38] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [19:47:38] PROBLEM - IPsec on cp2025 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1012_v4,kafka1012_v6 [19:47:38] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [19:47:39] PROBLEM - IPsec on cp4017 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [19:47:40] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [19:47:42] mobrovac: yep :) [19:47:49] PROBLEM - IPsec on cp3020 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1012_v4,kafka1012_v6 [19:47:49] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [19:47:49] PROBLEM - IPsec on cp3018 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1012_v4,kafka1012_v6 [19:47:49] PROBLEM - IPsec on cp3005 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [19:47:49] PROBLEM - IPsec on cp4007 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [19:47:49] PROBLEM - IPsec on cp4008 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [19:47:49] PROBLEM - IPsec on cp4018 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [19:47:50] PROBLEM - IPsec on cp3007 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [19:47:50] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [19:47:51] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [19:47:51] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [19:47:52] PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [19:47:53] kafka1012 maintenance? [19:47:56] ow ow ow [19:48:03] apergos: it's all just from one server [19:48:09] PROBLEM - IPsec on cp2018 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1012_v4,kafka1012_v6 [19:48:09] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [19:48:09] PROBLEM - IPsec on cp3013 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [19:48:09] PROBLEM - IPsec on cp3014 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [19:48:09] PROBLEM - IPsec on cp3009 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [19:48:09] PROBLEM - IPsec on cp3004 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [19:48:09] PROBLEM - IPsec on cp3022 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1012_v4,kafka1012_v6 [19:48:10] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [19:48:10] PROBLEM - IPsec on cp4002 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1012_v4,kafka1012_v6 [19:48:11] PROBLEM - IPsec on cp4015 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [19:48:13] makes my eyes bleeeed though [19:48:19] PROBLEM - IPsec on cp2009 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1012_v4,kafka1012_v6 [19:48:19] PROBLEM - IPsec on cp2006 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1012_v4,kafka1012_v6 [19:48:19] yes [19:48:19] PROBLEM - IPsec on cp2012 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1012_v4,kafka1012_v6 [19:48:19] PROBLEM - IPsec on cp3040 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [19:48:19] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [19:48:19] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [19:48:28] PROBLEM - IPsec on cp3017 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1012_v4,kafka1012_v6 [19:48:29] PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [19:48:29] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [19:48:30] PROBLEM - IPsec on cp2003 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1012_v4,kafka1012_v6 [19:48:30] PROBLEM - IPsec on cp4003 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1012_v4,kafka1012_v6 [19:48:30] PROBLEM - IPsec on cp4004 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1012_v4,kafka1012_v6 [19:48:30] PROBLEM - IPsec on cp4001 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1012_v4,kafka1012_v6 [19:48:38] PROBLEM - IPsec on cp4016 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [19:48:38] PROBLEM - IPsec on cp4005 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [19:48:38] PROBLEM - IPsec on cp4006 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [19:48:38] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [19:48:38] PROBLEM - IPsec on cp3041 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [19:48:38] PROBLEM - IPsec on cp3032 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [19:48:39] PROBLEM - IPsec on cp3021 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1012_v4,kafka1012_v6 [19:48:39] PROBLEM - IPsec on cp3030 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [19:48:40] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [19:48:40] PROBLEM - IPsec on cp3008 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [19:48:41] PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [19:48:41] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [19:48:42] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [19:48:42] PROBLEM - IPsec on cp3042 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [19:48:51] apergos: you know what helps? that we fixed the bot being kicked :) [19:48:58] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 1, unused: 0BRxe-1/2/0: down - DISABLEDBR [19:48:58] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [19:48:59] PROBLEM - IPsec on cp4009 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [19:48:59] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [19:48:59] PROBLEM - IPsec on cp4014 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [19:48:59] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [19:48:59] PROBLEM - IPsec on cp3010 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [19:49:00] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [19:49:00] PROBLEM - IPsec on cp3031 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [19:49:01] PROBLEM - IPsec on cp3016 is CRITICAL: Strongswan CRITICAL - ok: 18 not-conn: kafka1012_v4,kafka1012_v6 [19:49:01] PROBLEM - IPsec on cp3003 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [19:49:02] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [19:49:02] PROBLEM - IPsec on cp3006 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1012_v4,kafka1012_v6 [19:49:03] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 36 not-conn: kafka1012_v4,kafka1012_v6 [19:49:05] so it can spam the channel more without dying? [19:49:10] very helpful [19:49:10] yes [19:49:12] hahahaha exactly! [19:49:20] hm [19:49:27] it's time for chocolate cake, enough of this nonsense [19:49:32] also it's almost 10 pm wtf [19:49:40] how am I still in here working away [19:49:45] seems like kafka1012 is in schedule maint in general (for days), and I guess just now someone rebooted it as part of that, but it was up for ipsec purposes before [19:50:08] and I only got one lousy thing done on my todo list [19:50:09] bah [19:50:11] are we ok? [19:50:14] we're fine [19:50:17] (03PS1) 10Chad: Stop using separate file for contribution tracking configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268454 [19:50:20] coolio [19:50:42] mutante: staging good, proceeding to prod [19:50:46] (03CR) 10Alex Monk: [C: 031] "This is going to need a rebase, because wikiversions" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268004 (https://phabricator.wikimedia.org/T125501) (owner: 10Dereckson) [19:50:56] mobrovac: nice [19:53:50] (03CR) 10Jcrespo: [C: 04-2] "https://puppet-compiler.wmflabs.org/1677/ruthenium.eqiad.wmnet/ compiles ok, but the password has not yet been created on production." [puppet] - 10https://gerrit.wikimedia.org/r/268438 (https://phabricator.wikimedia.org/T125435) (owner: 10Jcrespo) [19:54:19] greg-g, would like to get that wiki creation scheduled, but there's a couple of blockers related to wikidata [19:54:28] !log demon@mira Synchronized private/PrivateSettings.php: (no message) (duration: 02m 05s) [19:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:54:44] (03PS1) 10Hashar: all wikis to 1.27.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268455 [19:55:19] (03Abandoned) 10Hashar: all wikis to 1.27.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268452 (owner: 10Hashar) [19:57:22] greg-g, actually, just one for this specific creation, and I think I know how to fix it... if not, it can be worked around easily [19:57:52] will go ahead and schedule [19:58:29] Krenair: ok, yeah, let me know if those things aren't easily addressable [19:59:20] PROBLEM - DPKG on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:59:29] PROBLEM - nutcracker port on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:59:54] 6operations, 5Patch-For-Review: reinstall eqiad memcache servers with jessie - https://phabricator.wikimedia.org/T123711#1999516 (10Dzahn) >>! In T123711#1998201, @elukey wrote: > How does it sound? It sounds good! The wiki page is very detailed and links to exisiting procedures and the plan sounds solid. fw... [20:00:04] hashar: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160204T2000). [20:00:08] PROBLEM - SSH on terbium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:00:08] o/ [20:00:09] PROBLEM - nutcracker process on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:00:09] PROBLEM - configured eth on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:00:18] terbium? [20:00:19] PROBLEM - salt-minion processes on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:00:24] !log restbase rolling restart after merging https://gerrit.wikimedia.org/r/#/c/268016/ [20:00:27] oh.. looking [20:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:00:29] PROBLEM - RAID on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:00:46] so gotta roll wmf.12 to group2 (aka all the rest of wiki) [20:00:48] PROBLEM - puppet last run on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:00:53] will hold till terbium is figured out [20:00:58] PROBLEM - dhclient process on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:00:59] i think it's moritz upgrading ? [20:01:07] that's supposed to be tomorrow? [20:01:12] connects to console [20:01:19] https://wikitech.wikimedia.org/wiki/Deployments#Friday.2C.C2.A0February.C2.A005 [20:01:27] (03PS3) 10Jdlrobson: Make mobile-beta an available platform [puppet] - 10https://gerrit.wikimedia.org/r/268329 [20:01:28] PROBLEM - Disk space on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:01:58] i have a login on serial, it's alive [20:02:00] RECOVERY - SSH on terbium is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 (protocol 2.0) [20:02:09] RECOVERY - nutcracker process on terbium is OK: PROCS OK: 1 process with UID = 109 (nutcracker), command name nutcracker [20:02:09] RECOVERY - configured eth on terbium is OK: OK - interfaces up [20:02:10] RECOVERY - salt-minion processes on terbium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:02:11] eh, didnt do anything yet [20:02:28] RECOVERY - RAID on terbium is OK: OK: optimal, 1 logical, 2 physical [20:02:34] well not "tomorrow", but the scheduled window is ~7h from now [20:02:40] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 37 minutes ago with 0 failures [20:02:41] bblack: my tomorrow [20:02:45] i can ssh to it like normal again [20:02:48] RECOVERY - dhclient process on terbium is OK: PROCS OK: 0 processes with command name dhclient [20:02:49] (03CR) 10Thcipriani: [C: 032] all wikis to 1.27.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268455 (owner: 10Hashar) [20:02:58] uptime still 80 days [20:02:59] err wait, 15h from now! [20:03:08] timezones suck [20:03:16] so terbium is just a flap ? [20:03:18] RECOVERY - Disk space on terbium is OK: DISK OK [20:03:19] RECOVERY - DPKG on terbium is OK: All packages OK [20:03:19] (03Merged) 10jenkins-bot: all wikis to 1.27.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268455 (owner: 10Hashar) [20:03:24] hashar: yes [20:03:27] ok proceeding [20:03:29] RECOVERY - nutcracker port on terbium is OK: TCP OK - 0.000 second response time on port 11212 [20:03:37] !log all wikis to 1.27.0-wmf.12 (yeah really) [20:03:38] there was an oomkill there [20:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:03:51] !log hashar@mira rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.27.0-wmf.12 [20:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:04:03] hmm [20:04:16] thcipriani: it is ultra fast :D [20:04:20] Feb 4 11:08:33 terbium kernel: [6922264.565452] php5[8531]: segfault at 0 ip 00007f88d4edb1d6 sp 00007ffc98534c98 error 4 in libc-2.19.so[7f88d4e43000+1bb000] [20:04:23] Feb 4 20:00:57 terbium kernel: [6954231.390890] php5 invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 [20:04:24] terbium php: #012Fatal error: Failed in afdt_send: Broken pipe [20:05:27] basically, a few other things died from oom-y conditions just before/as some php thing got oomkilled for sucking up all the memory, or something like that [20:05:36] ottomata: there is a foreign cfg on the disk [20:05:42] need to clear that [20:05:47] running puppet in case it needs to restart anything [20:05:48] oh [20:05:49] ok? [20:06:02] > var_dump( class_exists( "CirrusSearch\Maintenance\UpdateSearchIndexConfig" ) ); [20:06:02] bool(true) [20:06:03] > var_dump( class_exists( "Wikibase\PopulateSitesTable" ) ); [20:06:03] bool(false) [20:06:03] hmm [20:06:28] yeah, it's the autoloader [20:06:31] cmjohnson1: it is attempting boot now... [20:07:19] ha, ooook [20:07:20] [ OK ] Mounted /var/spool/kafka/f. [20:07:23] dunno how it would do that [20:07:37] bblack: since that host is full of cron jobs, i guess one of them may have triggered it. and i saw extensions/WikimediaMaintenance/getJobQueueLengths.php a few seconds around/before that [20:07:48] and the jobqueue part caught my eye [20:07:55] mutante: RB for ady.wp all done [20:08:01] yeah...that disk needs to have the old cfg cleared first [20:08:04] mobrovac: cool, thank you [20:08:09] PROBLEM - HHVM rendering on mw1117 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 9.935 second response time [20:08:16] (03CR) 10Dzahn: "restbase server config has been added in prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268004 (https://phabricator.wikimedia.org/T125501) (owner: 10Dereckson) [20:08:19] PROBLEM - Apache HTTP on mw1117 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:08:25] * aude back and assume there are no problems with the train [20:09:19] PROBLEM - salt-minion processes on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:09:19] PROBLEM - RAID on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:09:19] PROBLEM - HHVM processes on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:09:30] PROBLEM - nutcracker port on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:09:34] aude: I am looking graphs/stuff no troubles so far [20:09:39] PROBLEM - SSH on mw1117 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:09:49] PROBLEM - Check size of conntrack table on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:10:09] PROBLEM - configured eth on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:10:29] PROBLEM - nutcracker process on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:10:49] PROBLEM - puppet last run on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:10:50] PROBLEM - dhclient process on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:10:58] PROBLEM - DPKG on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:10:59] PROBLEM - Disk space on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:13:01] is mw1117 supposed to be doing that? [20:13:32] not from us [20:13:44] expects that once hhvm kills itself it will come back.. already tries to login. it's not down but ultra busy [20:13:52] login timed out .. hrmm [20:14:13] i'm getting a bit tired of this access stuff ... [20:14:19] depool it maybe? [20:14:22] mutante: mind restart mathoid on scb100x please? [20:14:32] hashar: yay :) [20:14:52] re no troubles [20:15:26] !log scb1001/scb1002 service mathoid restart [20:15:29] mobrovac: done [20:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:15:33] !log cr1-ulsfo: turning up BGP with Zayo [20:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:16:13] !log mw1117 - powercycled [20:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:16:40] hashar: still good on the MW side of things? [20:16:49] yeah [20:16:53] greg-g: all fine apparently [20:17:04] but I am waiting for the snowball effect to kick in [20:18:11] (03PS1) 10Addshore: Add $wgWBRepoSettings['sparqlEndpoint'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268467 (https://phabricator.wikimedia.org/T125353) [20:18:19] RECOVERY - puppet last run on mw1117 is OK: OK: Puppet is currently enabled, last run 41 minutes ago with 0 failures [20:18:19] RECOVERY - dhclient process on mw1117 is OK: PROCS OK: 0 processes with command name dhclient [20:18:20] RECOVERY - DPKG on mw1117 is OK: All packages OK [20:18:28] RECOVERY - Disk space on mw1117 is OK: DISK OK [20:18:40] RECOVERY - RAID on mw1117 is OK: OK: no RAID installed [20:18:40] RECOVERY - salt-minion processes on mw1117 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:18:40] RECOVERY - HHVM processes on mw1117 is OK: PROCS OK: 6 processes with command name hhvm [20:18:58] RECOVERY - nutcracker port on mw1117 is OK: TCP OK - 0.000 second response time on port 11212 [20:18:59] RECOVERY - SSH on mw1117 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 (protocol 2.0) [20:19:10] RECOVERY - Check size of conntrack table on mw1117 is OK: OK: nf_conntrack is 1 % full [20:19:28] RECOVERY - configured eth on mw1117 is OK: OK - interfaces up [20:19:29] RECOVERY - HHVM rendering on mw1117 is OK: HTTP OK: HTTP/1.1 200 OK - 65511 bytes in 1.008 second response time [20:19:39] RECOVERY - Apache HTTP on mw1117 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.178 second response time [20:19:46] still 93 CRITs..sigh [20:19:51] RECOVERY - nutcracker process on mw1117 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [20:20:37] (03CR) 10Mobrovac: [C: 031] ruthenium: Some more tweaks to parsoid + visualdiffing services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/268326 (owner: 10Subramanya Sastry) [20:20:52] 6operations, 10ops-codfw, 10procurement: ulsfo: AA batteries - https://phabricator.wikimedia.org/T125873#1999599 (10RobH) 3NEW a:3RobH [20:22:14] (03PS1) 10Dereckson: Add \n to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268468 [20:22:49] RECOVERY - HHVM rendering on mw1115 is OK: HTTP OK: HTTP/1.1 200 OK - 65517 bytes in 1.156 second response time [20:22:50] RECOVERY - Apache HTTP on mw1115 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.076 second response time [20:23:06] 6operations, 6Labs, 10Labs-Infrastructure, 10Tool-Labs, 7Monitoring: Ensure mysql credential creation for tools users is running - https://phabricator.wikimedia.org/T125874#1999618 (10Dzahn) 3NEW [20:23:06] !log mw1115 service hhvm restart [20:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:23:37] ACKNOWLEDGEMENT - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed daniel_zahn https://phabricator.wikimedia.org/T125874 [20:25:33] (03CR) 10Hashar: "I noticed that when manually editing with vim (which adds a newline at end of file). Looks fine to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268468 (owner: 10Dereckson) [20:25:40] (03CR) 10Hashar: [C: 031] Add \n to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268468 (owner: 10Dereckson) [20:26:50] !log All wikis to 1.27.0-wmf.12 No troubles so far congratulations to everyone involved @wikimedia #wikimedia [20:26:50] !log eeden service ntp restart [20:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:27:42] hashar: hash tags in !log work because it still gets on actual twitter ? [20:27:53] hashar, ... ew, can we not put twitter stuff in SAL please? [20:27:54] 6operations, 10Deployment-Systems, 6Performance-Team, 10Traffic, 5Patch-For-Review: Make Varnish cache for /static/$wmfbranch/ expire when resources change within branch lifetime - https://phabricator.wikimedia.org/T99096#1999652 (10Krinkle) [20:27:58] mutante: I am assuming they are still relaying to twitter yeah [20:28:51] why not put the hash [20:28:57] it's not like we're putting some random tag in there [20:30:55] !log cr1-ulsfo: deactivating BGP peering with GTT [20:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:31:29] RECOVERY - IPsec on cp4011 is OK: Strongswan OK - 20 ESP OK [20:31:29] RECOVERY - IPsec on cp4012 is OK: Strongswan OK - 20 ESP OK [20:31:29] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 38 ESP OK [20:31:29] RECOVERY - IPsec on cp3015 is OK: Strongswan OK - 20 ESP OK [20:31:29] RECOVERY - IPsec on cp3019 is OK: Strongswan OK - 20 ESP OK [20:31:29] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 38 ESP OK [20:31:30] RECOVERY - IPsec on cp4009 is OK: Strongswan OK - 28 ESP OK [20:31:30] RECOVERY - IPsec on cp4014 is OK: Strongswan OK - 38 ESP OK [20:31:31] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 38 ESP OK [20:31:31] RECOVERY - IPsec on cp3010 is OK: Strongswan OK - 28 ESP OK [20:31:32] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 38 ESP OK [20:31:39] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 38 ESP OK [20:31:40] RECOVERY - IPsec on cp3031 is OK: Strongswan OK - 28 ESP OK [20:31:40] RECOVERY - IPsec on cp3003 is OK: Strongswan OK - 28 ESP OK [20:31:40] RECOVERY - IPsec on cp3016 is OK: Strongswan OK - 20 ESP OK [20:31:40] RECOVERY - IPsec on cp3006 is OK: Strongswan OK - 28 ESP OK [20:31:40] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 38 ESP OK [20:31:40] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 38 ESP OK [20:31:49] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 38 ESP OK [20:31:49] RECOVERY - IPsec on cp2015 is OK: Strongswan OK - 20 ESP OK [20:31:59] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 38 ESP OK [20:32:00] RECOVERY - IPsec on cp2021 is OK: Strongswan OK - 20 ESP OK [20:32:00] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 28 ESP OK [20:32:00] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 38 ESP OK [20:32:03] apergos: https://wikitech.wikimedia.org/wiki/Incident_documentation/20160204-Phabricator updated, I think all the details are there now, mystery solved ... [20:32:08] RECOVERY - IPsec on cp2025 is OK: Strongswan OK - 20 ESP OK [20:32:09] RECOVERY - IPsec on cp4017 is OK: Strongswan OK - 28 ESP OK [20:32:17] apergos: thanks for deducing the root cause [20:32:18] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 38 ESP OK [20:32:19] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 28 ESP OK [20:32:19] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 38 ESP OK [20:32:19] RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 28 ESP OK [20:32:19] RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 28 ESP OK [20:32:19] RECOVERY - IPsec on cp4008 is OK: Strongswan OK - 28 ESP OK [20:32:19] RECOVERY - IPsec on cp4007 is OK: Strongswan OK - 38 ESP OK [20:32:20] RECOVERY - IPsec on cp4018 is OK: Strongswan OK - 28 ESP OK [20:32:20] RECOVERY - IPsec on cp3005 is OK: Strongswan OK - 28 ESP OK [20:32:21] RECOVERY - IPsec on cp3020 is OK: Strongswan OK - 20 ESP OK [20:32:21] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 38 ESP OK [20:32:22] RECOVERY - IPsec on cp3007 is OK: Strongswan OK - 28 ESP OK [20:32:22] RECOVERY - IPsec on cp3018 is OK: Strongswan OK - 20 ESP OK [20:32:28] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 38 ESP OK [20:32:30] RECOVERY - IPsec on cp4019 is OK: Strongswan OK - 20 ESP OK [20:32:30] RECOVERY - IPsec on cp4010 is OK: Strongswan OK - 28 ESP OK [20:32:30] RECOVERY - IPsec on cp4020 is OK: Strongswan OK - 20 ESP OK [20:32:30] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 28 ESP OK [20:32:38] RECOVERY - IPsec on cp2018 is OK: Strongswan OK - 20 ESP OK [20:32:38] RECOVERY - IPsec on cp3033 is OK: Strongswan OK - 38 ESP OK [20:32:39] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 38 ESP OK [20:32:39] RECOVERY - IPsec on cp3013 is OK: Strongswan OK - 28 ESP OK [20:32:39] RECOVERY - IPsec on cp3009 is OK: Strongswan OK - 28 ESP OK [20:32:39] RECOVERY - IPsec on cp3014 is OK: Strongswan OK - 28 ESP OK [20:32:40] RECOVERY - IPsec on cp3004 is OK: Strongswan OK - 28 ESP OK [20:32:40] RECOVERY - IPsec on cp2009 is OK: Strongswan OK - 20 ESP OK [20:32:40] RECOVERY - IPsec on cp4002 is OK: Strongswan OK - 20 ESP OK [20:32:40] RECOVERY - IPsec on cp3022 is OK: Strongswan OK - 20 ESP OK [20:32:41] RECOVERY - IPsec on cp2006 is OK: Strongswan OK - 20 ESP OK [20:32:46] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#1999660 (10greg) Terbium is scheduled for a reboot tonight (3am Pacific, 11am UTC): https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160205T1100 Shou... [20:32:48] RECOVERY - IPsec on cp4015 is OK: Strongswan OK - 38 ESP OK [20:32:48] RECOVERY - IPsec on cp2012 is OK: Strongswan OK - 20 ESP OK [20:32:48] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 28 ESP OK [20:32:49] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 38 ESP OK [20:32:58] RECOVERY - IPsec on cp3040 is OK: Strongswan OK - 28 ESP OK [20:32:59] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 38 ESP OK [20:32:59] RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 28 ESP OK [20:32:59] RECOVERY - IPsec on cp2003 is OK: Strongswan OK - 20 ESP OK [20:33:00] RECOVERY - IPsec on cp3017 is OK: Strongswan OK - 20 ESP OK [20:33:08] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 28 ESP OK [20:33:08] RECOVERY - IPsec on cp4003 is OK: Strongswan OK - 20 ESP OK [20:33:09] RECOVERY - IPsec on cp4001 is OK: Strongswan OK - 20 ESP OK [20:33:09] RECOVERY - IPsec on cp4016 is OK: Strongswan OK - 28 ESP OK [20:33:09] RECOVERY - IPsec on cp4004 is OK: Strongswan OK - 20 ESP OK [20:33:09] RECOVERY - IPsec on cp4005 is OK: Strongswan OK - 38 ESP OK [20:33:09] RECOVERY - IPsec on cp3041 is OK: Strongswan OK - 28 ESP OK [20:33:10] RECOVERY - IPsec on cp3021 is OK: Strongswan OK - 20 ESP OK [20:33:10] RECOVERY - IPsec on cp3030 is OK: Strongswan OK - 28 ESP OK [20:33:11] RECOVERY - IPsec on cp3032 is OK: Strongswan OK - 38 ESP OK [20:33:11] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 38 ESP OK [20:33:12] RECOVERY - IPsec on cp3008 is OK: Strongswan OK - 28 ESP OK [20:33:12] RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 38 ESP OK [20:33:13] RECOVERY - IPsec on cp4006 is OK: Strongswan OK - 38 ESP OK [20:33:14] (03CR) 10Thcipriani: [C: 04-1] Add \n to wikiversions.json (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268468 (owner: 10Dereckson) [20:33:28] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 75, down: 0, dormant: 0, excluded: 1, unused: 0 [20:33:31] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#1999663 (10greg) Adding @MoritzMuehlenhoff regarding the reboot [20:33:49] twentyafterfour: happy to help! [20:34:56] (03CR) 10Dzahn: [C: 031] Add access to m5-master:testreduce* dbs for ssastry on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/268438 (https://phabricator.wikimedia.org/T125435) (owner: 10Jcrespo) [20:34:59] RECOVERY - NTP on eeden is OK: NTP OK: Offset -0.001519918442 secs [20:37:02] (03PS1) 10Aude: Enable ArticlePlaceholder extension in beta only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268471 [20:38:27] (03PS1) 10Aude: Remove WB_EXPERIMENTAL_FEATURES (was labs only) setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268472 [20:38:31] (03PS2) 10Dereckson: Add \n to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268468 [20:41:21] thcipriani: indeed, if not that would be "0" in the JSON file, thanks to have catch that [20:42:21] Dereckson: sure thing :) [20:42:45] 10Ops-Access-Requests, 6operations: Allow mobrovac to start/stop/restart services on SCx - https://phabricator.wikimedia.org/T125879#1999722 (10mobrovac) @Gwicke please approve. [20:45:37] !log rsc1001 - schedule downtime, reboot [20:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:55:35] (03CR) 10Hashar: Add \n to wikiversions.json (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268468 (owner: 10Dereckson) [20:56:09] (03CR) 10Hashar: [C: 031] Add \n to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268468 (owner: 10Dereckson) [21:00:01] argh.... the next swat is already full :/ [21:00:22] 6operations, 10DBA, 7Epic: Eliminate SPOF at the main database infrastructure - https://phabricator.wikimedia.org/T119626#1831329 (10jcrespo) [21:00:48] greg-g: would it be ok if i deploy https://gerrit.wikimedia.org/r/#/c/268471/ ? (enable ArticlePlaceholder on beta only) [21:01:16] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#1999848 (10Anomie) Now that wmf.12 is everywhere, an alternative would be to just kill the job and set `$wgAuthenticationTokenVersion` to a non-null value in our conf... [21:01:19] would want to sync this also (assume on mira?) [21:01:34] my other patches can wait until monday [21:02:48] 6operations, 10DBA: Create a Master-master topology between datacenters for easier failover (setup circular replication dallas -> eqiad for mysql databases) - https://phabricator.wikimedia.org/T119642#1999858 (10jcrespo) [21:03:02] 6operations, 10DBA, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Setup circular replication (dallas -> eqiad) for databases - https://phabricator.wikimedia.org/T124698#1999863 (10jcrespo) [21:03:04] 6operations, 10DBA: Create a Master-master topology between datacenters for easier failover (setup circular replication dallas -> eqiad for mysql databases) - https://phabricator.wikimedia.org/T119642#1831941 (10jcrespo) [21:03:33] 6operations, 10DBA, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Create a Master-master topology between datacenters for easier failover (setup circular replication dallas -> eqiad for mysql databases) - https://phabricator.wikimedia.org/T119642#1999867 (10jcrespo) [21:03:49] 6operations, 10DBA, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Create a Master-master topology between datacenters for easier failover (setup circular replication dallas -> eqiad for mysql databases) - https://phabricator.wikimedia.org/T119642#1999868 (10jcrespo) a:3jcrespo [21:08:45] 6operations, 10MediaWiki-API, 6Services, 10Traffic, 7Monitoring: Set up action API latency / error rate metrics & alerts - https://phabricator.wikimedia.org/T123854#1999899 (10GWicke) Code behind the current metrics is at https://github.com/wikimedia/operations-puppet/blob/c62c102e8/modules/varnish/files... [21:11:20] What's up with redis? [21:11:32] Oh rcs, gotcha, misread. [21:11:38] nvm, expected. [21:12:43] 6operations, 10Deployment-Systems, 6Performance-Team, 10Traffic, 5Patch-For-Review: Make Varnish cache for /static/$wmfbranch/ expire when resources change within branch lifetime - https://phabricator.wikimedia.org/T99096#1999935 (10Krinkle) Deployment strategy: 1. [x] [mediawiki/core] Change MediaWiki... [21:15:09] !log rcs1002 - reboot for kernel [21:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:15:30] (03CR) 10Hashar: [C: 031] "Lets make it better. Feel free to cherry pick this on the beta cluster puppet master instance ( deployment-puppetmaster.deployment-prep.eq" [puppet] - 10https://gerrit.wikimedia.org/r/268022 (owner: 10Tim Starling) [21:16:59] (03PS1) 10EBernhardson: Better mediawiki REPL [puppet] - 10https://gerrit.wikimedia.org/r/268541 [21:17:43] ori: ^^ [21:17:51] omg [21:19:09] !log rcs1002 - start redis [21:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:19:20] PROBLEM - PyBal backends health check on lvs1002 is CRITICAL: PYBAL CRITICAL - streamlb_443 - Could not depool server rcs1001.eqiad.wmnet because of too many down!: streamlb6_443 - Could not depool server rcs1002.eqiad.wmnet because of too many down! [21:19:38] that should be over , like now [21:19:48] i had to start the service myself [21:19:50] PROBLEM - PyBal backends health check on lvs1008 is CRITICAL: PYBAL CRITICAL - streamlb6_443 - Could not depool server rcs1001.eqiad.wmnet because of too many down!: streamlb6_80 - Could not depool server rcs1001.eqiad.wmnet because of too many down! [21:19:59] PROBLEM - PyBal backends health check on lvs1011 is CRITICAL: PYBAL CRITICAL - streamlb_443 - Could not depool server rcs1002.eqiad.wmnet because of too many down!: streamlb_80 - Could not depool server rcs1001.eqiad.wmnet because of too many down!: streamlb6_443 - Could not depool server rcs1001.eqiad.wmnet because of too many down! [21:20:08] PROBLEM - PyBal backends health check on lvs1005 is CRITICAL: PYBAL CRITICAL - streamlb_80 - Could not depool server rcs1002.eqiad.wmnet because of too many down!: streamlb6_80 - Could not depool server rcs1002.eqiad.wmnet because of too many down! [21:20:09] mutante: it's sayting 1001 wants to be depooled too [21:20:26] it shouldnt be :/ [21:20:56] the service is running.. hrmm [21:21:10] RECOVERY - PyBal backends health check on lvs1002 is OK: PYBAL OK - All pools are healthy [21:21:33] !log setting up OSPF/OSPF3/PIM between ulsfo and codfw (cr2-ulsfo/cr1-codfw) [21:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:21:39] RECOVERY - PyBal backends health check on lvs1008 is OK: PYBAL OK - All pools are healthy [21:21:50] RECOVERY - PyBal backends health check on lvs1005 is OK: PYBAL OK - All pools are healthy [21:21:59] phhew, ok [21:23:40] definitely spammed logstash. [21:23:47] :) [21:26:14] (03PS6) 10MarcoAurelio: Enabling Extension:ShortUrl on or.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267780 (https://phabricator.wikimedia.org/T124429) [21:27:09] RECOVERY - PyBal backends health check on lvs1011 is OK: PYBAL OK - All pools are healthy [21:28:40] mutante: I'm still getting a ton of "cannot connect to redis" on rcs1001 [21:28:46] aude: doit, and yeah, mira [21:28:55] greg-g: ok [21:29:35] !log rcs1001 started redis [21:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:29:44] ostriches: indeed it was not running again .. uhmm.. [21:30:08] redis-server is running [21:30:29] Thereeee it goes [21:30:33] MW finally caught up [21:31:45] (03CR) 10Chad: [C: 032] Stop using separate file for contribution tracking configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268454 (owner: 10Chad) [21:31:46] luckily the clients were on rcs1002 meanwhile. jdl_robson confirmed his service survived this [21:32:06] (03CR) 10Aude: [C: 032] Enable ArticlePlaceholder extension in beta only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268471 (owner: 10Aude) [21:32:12] mutante: I wonder if MW wasn't using the lvs'd one somewhere? [21:32:14] ah [21:32:16] * ostriches has no idea how this works [21:32:21] So guessing! [21:32:30] ostriches: do you want to deploy my patch? [21:32:35] link? [21:32:40] Oh that one ^? [21:32:42] i just +2'd it [21:33:02] ok, you are touching different files [21:33:09] i can do mine if you like, after you [21:33:15] I can do it together nbd. [21:33:20] ok [21:33:24] thanks :) [21:33:25] (03Merged) 10jenkins-bot: Stop using separate file for contribution tracking configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268454 (owner: 10Chad) [21:33:45] ostriches: it stopped again .. what the .. [21:34:03] Boom there it went [21:34:04] ... [21:34:06] (03Merged) 10jenkins-bot: Enable ArticlePlaceholder extension in beta only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268471 (owner: 10Aude) [21:34:14] ori: somehow the redis servers keep stopping on those rcs boxes [21:34:20] aude: Imma wait until we figure out what's going on with this first tho. [21:34:24] ok [21:34:24] mutante: both? [21:34:24] i'm double checking puppet does it [21:34:25] no hurry [21:34:27] yes [21:34:37] after a while they are stopped again [21:34:37] mutante: ok, which one can I look at? [21:34:46] ori: rcs1002 [21:34:56] i'm taking the other one [21:35:25] started it, running puppet [21:35:36] the main redis-server service shouldn't run since we went multi-instance [21:35:53] oh, but ostriches saw the connection error in logs [21:36:01] and they stopped when i started the main service [21:36:14] Notice: /Stage[main]/Redis/Service[redis-server]/ensure: ensure changed 'running' to 'stopped' [21:36:23] https://logstash.wikimedia.org/#dashboard/temp/AVKuNud2ptxhN1XaWWkL [21:36:30] yeah, because both it and the tcp_6379 instance want port 6379 [21:36:42] it is not correct to start redis-server, puppet is doing the right thing [21:36:58] redis is up [21:37:00] on rcs1002 [21:37:03] and rcstream too [21:37:11] i did not do anything [21:37:26] (03PS1) 10Milimetric: [WIP] Update AQS config with new syntax [puppet] - 10https://gerrit.wikimedia.org/r/268560 (https://phabricator.wikimedia.org/T122249) [21:37:30] I've only seen rcs1001 complaining in MW [21:37:33] *logstash [21:37:39] rcstreamstatus also says its ok on rcs1001 [21:37:41] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [21:37:41] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [21:37:59] no, it's not [21:38:22] i started it on rcs1001 [21:38:25] and it was a mistake [21:38:27] so, i am editing that line out [21:38:27] 12:59 it will not be restored by puppet [21:38:28] 12:59 thanks [21:38:29] 12:59 and then will start redis-server [21:38:31] 12:59 > redis-server is running [21:38:38] (03PS5) 10Dereckson: Initial configuration for ady.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268004 (https://phabricator.wikimedia.org/T125501) [21:38:38] redis-server should not be running, it's the instance-specific service that should be [21:38:43] puppet is right, I'm wrong [21:38:49] ok, *nod* [21:38:57] i stopped the redis-server, running puppet [21:39:04] that should do it, yeah [21:39:07] checking rcstreamctl [21:39:14] it would have stopped redis-server anyhow, but yeah [21:39:18] i basically sabotaged rcs1001 :/ [21:39:24] it's still "start/post-stop" not "start/running? [21:40:28] (03CR) 10Dereckson: "PS5: rebased, updated wikiversions.json to use 1.27.0-wmf.12" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268004 (https://phabricator.wikimedia.org/T125501) (owner: 10Dereckson) [21:41:01] I'm taking a look. MediaWiki writes to both servers independently, by the way, so rcstream itself should be up [21:41:07] since rcs1002 is [21:41:25] kernel: [ 3168.131961] init: rcstream/server (127.0.0.1:10100) main process ended, respawning [21:41:34] because redis is down [21:41:39] but puppet is not starting it [21:42:15] what's the right redis service called? [21:42:32] redis-instance-tcp_6379 [21:42:38] which upstart says is running [21:42:39] but it lies [21:42:46] it must not be tracking the process correctly [21:43:13] ok, I know what is going on [21:44:49] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:44:49] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:47:40] mutante: here's what happened: [21:49:55] Can I get a root to `chgrp -R wikidev /srv/mediawiki-staging/.git/objects` on mira? Some objects got owned by root [21:50:26] ostriches: yea [21:50:27] - Upstart started redis-instance-tcp_6379. It expected it to fork, which it did not, because: redis.log:[4154] 04 Feb 20:49:38.897 # Can't chdir to '/var/run/redis': No such file or directory [21:51:10] ori: aaah :) so you created /var/run/redis just now? [21:51:31] - For that reason (see for some context), Upstart was not correctly tracking the process, and upstart kept reporting that redis-instance-tcp_6379 is running [21:51:43] - Because it was running, Puppet had no reason to want to restart it [21:52:23] !log mira chgrp -R wikidev /srv/mediawiki-staging/.git/objects [21:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:52:27] I did not create it [21:52:30] it gets more complicated [21:52:52] heh, ok :) [21:52:56] when I puppetized the redis instance thing, I must have assumed /var/run/redis was created by the package [21:53:04] because it was on all the hosts and it was not in puppet [21:53:21] it is in fact created by /etc/init.d/redis-server [21:53:26] mutante: Bleh, not enough, they're not group writable. `chown -R g+w /srv/mediawiki-staging/.git/objects` [21:53:26] which we are no longer using [21:53:34] root stole the hell out of those couple of directories :p [21:53:56] ori: *nod* [21:54:08] you or me tried to start redis-server, which then bound port 6379, but also created /var/run [21:54:12] aude: Deploying when root gives me back mw-staging again :p [21:54:38] when it got killed, and i ran 'service redis-instance-tcp_6379 start', it actually started again, because /var/run/redis was now there [21:54:51] so here's what I'd like to do, let me know what you think: [21:54:57] - puppetize /var/run/redis [21:55:07] ostriches: chmod, not chown, done [21:55:08] - reboot rcs1001 one more time, to verify that this issue won't bite us next time [21:55:16] ty! [21:55:24] ostriches: ok [21:55:24] rcstream service should remain up throughout based on rcs1002 being live. [21:55:38] suppose it's already on beta now [21:55:39] sounds good? [21:55:39] * aude checks [21:55:53] ori: yes, sounds good, let's do that [21:56:02] ok, i'll have a puppet patch for you in a sec [21:56:11] (thanks, by the way) [21:56:21] 'k, np [21:57:46] !log demon@mira Synchronized wmf-config/: gerrit 268471, 268454 (duration: 01m 18s) [21:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:57:50] aude: ^^ [21:58:52] thanks [21:59:42] doesn't appear synced yet on beta (checking on deployment-mediawiki02) [22:00:58] aude: looks like the beta cluster scap is still running -- https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-eqiad/88638/console [22:01:10] !log demon@mira Synchronized wmf-config/PrivateSettings.php: touch symlink (duration: 01m 15s) [22:01:13] yeah [22:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:01:53] It is not undefined you stupid... [22:10:48] !log demon@mira Synchronized private/: touch (duration: 01m 15s) [22:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:11:48] (03PS4) 10Tim Starling: For all apache access logs, use the WMF cache log format [puppet] - 10https://gerrit.wikimedia.org/r/268022 [22:12:02] !log demon@mira Synchronized wmf-config/: touch (duration: 01m 14s) [22:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:14:41] When in doubt, touch all the config files [22:16:39] bd808: thanks for that request for help testing sm email [22:17:22] It's. Not. Undefined. [22:17:23] (03CR) 10Mobrovac: [C: 04-1] "LGTM. Just need to append the following line to the end of the config:" [puppet] - 10https://gerrit.wikimedia.org/r/268560 (https://phabricator.wikimedia.org/T122249) (owner: 10Milimetric) [22:17:28] How is this variable undefined? [22:18:10] PrivateSettings is included near the top of CommonSettings and all its variables are in scope. Why is the one I'm introducing undefined? [22:18:15] It's there. It's totally there. [22:19:39] a room rejected my invitation today, so now i know we live in a world where anything is possible [22:20:10] ostriches: the error rate is pretty low for it to really be undef. Looks like local cache on a few hosts? [22:20:18] A fair bunch.... [22:21:20] most only have 1 error in the last 5 mins [22:22:29] Well it's only enabled on 3 wikis [22:22:34] foundation, donate and test. [22:23:00] the cleint side is supposed to touch commonsettings on every sync to bust the local cache but I know we've seen it race before [22:23:03] (03PS3) 10Tim Starling: parsoid-rt-client: Have testreduce clients use global parsoid service [puppet] - 10https://gerrit.wikimedia.org/r/268230 (owner: 10Subramanya Sastry) [22:23:09] (03CR) 10Tim Starling: [C: 032] parsoid-rt-client: Have testreduce clients use global parsoid service [puppet] - 10https://gerrit.wikimedia.org/r/268230 (owner: 10Subramanya Sastry) [22:25:12] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#2000289 (10greg) >>! In T124440#1999848, @Anomie wrote: > Unless we think we might need to roll back to wmf.10 again, anyway. To act in an abundance of caution, I wo... [22:26:53] (03PS1) 10Dereckson: Namespaces configuration on mai.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268573 (https://phabricator.wikimedia.org/T125801) [22:27:03] Dereckson, https://phabricator.wikimedia.org/T69223#2000291 [22:27:18] (03PS1) 10Chad: Revert "Stop using separate file for contribution tracking configuration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268574 [22:27:30] (03CR) 10Chad: [C: 032 V: 032] "this is dumb" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268574 (owner: 10Chad) [22:27:47] (03PS5) 10Tim Starling: ruthenium: Some more tweaks to parsoid + visualdiffing services [puppet] - 10https://gerrit.wikimedia.org/r/268326 (owner: 10Subramanya Sastry) [22:27:55] (03CR) 10Tim Starling: [C: 032] ruthenium: Some more tweaks to parsoid + visualdiffing services [puppet] - 10https://gerrit.wikimedia.org/r/268326 (owner: 10Subramanya Sastry) [22:29:14] jynus: so it's not Blocked-on-schema-change anymore? [22:29:27] you can leave it there, no need to spam [22:30:13] !log demon@mira Synchronized wmf-config/: undo my cleanup grumble grumble (duration: 01m 16s) [22:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:31:14] mobrovac, what do you mean a room rejected your invitation? [22:31:48] Dereckson, technically, it is "pending to test", but I the problem is when someone is waiting for me, but I do not know because it doesn't have the tag or I wasn't pinged- it is now in the loop [22:32:00] Krenair: i was reserving a room here at the office for a meeting and got a mail "R32 rejected your invitation" :D [22:32:15] (03CR) 10Legoktm: "What went wrong?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268574 (owner: 10Chad) [22:32:21] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#2000319 (10csteipp) > It probably doesn't matter if we do it in CommonSettings.php or in the private settings since that's the "public" part of the hmac, but ask @cst... [22:32:22] mobrovac, oh. did someone else reserve it for that time? [22:32:42] yup, but it was sorted out eventually [22:32:54] it was just funny to receive a rejection from a room as if it were a person [22:33:07] 'twas my fault [22:33:19] I didn't drop the room even though I didn't commute in today [22:34:10] bd808: Well that was dumb. I'll mess with it again later. [22:36:20] (03PS1) 10Mobrovac: RESTBase: Labs: Use RB 0.10.x config style [puppet] - 10https://gerrit.wikimedia.org/r/268575 [22:36:33] mutante: ^^^ [22:38:32] (03PS2) 10Dzahn: RESTBase: Labs: Use RB 0.10.x config style [puppet] - 10https://gerrit.wikimedia.org/r/268575 (owner: 10Mobrovac) [22:39:52] (03CR) 10Dzahn: [C: 032] "confirmed base_path on cerium" [puppet] - 10https://gerrit.wikimedia.org/r/268575 (owner: 10Mobrovac) [22:41:15] (03PS3) 10Dzahn: grafana: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266978 [22:41:26] (03PS1) 10Krinkle: mediawiki: Clean up beta sites Apache configs [puppet] - 10https://gerrit.wikimedia.org/r/268578 [22:41:49] (03CR) 10Dzahn: [C: 032] grafana: fix top-scope var without namespace [puppet] - 10https://gerrit.wikimedia.org/r/266978 (owner: 10Dzahn) [22:44:28] (03PS1) 10Ori.livneh: Write redis's pid file to /var/lib/redis by default. [puppet] - 10https://gerrit.wikimedia.org/r/268580 [22:44:31] mutante: ^ [22:44:35] see commit msg for rationale [22:45:11] !log rebooting cp1060 to test traffic-pool stuff [22:45:12] ori: ok! also see the grafana thing above, i added that ensure to the parameter [22:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:45:41] mutante: $ensure should always go first :) [22:46:00] mutante: https://docs.puppetlabs.com/guides/style_guide.html#attribute-ordering [22:46:37] ori: yea, should have known, fixed a lot of those lint warnings :p [22:47:17] (03CR) 10Dzahn: [C: 031] Add missing & (typo?) [puppet] - 10https://gerrit.wikimedia.org/r/268341 (owner: 1020after4) [22:49:32] ori: i have to go, i will get on it as soon as i'm back [22:50:13] mutante: I'll apply it, since it's a trap [22:50:18] if you don't mind [22:51:53] it works! [22:52:07] it..? [22:52:51] sorry I'm in my own world [22:53:28] automatic depool of cpNNNN services if any daemon dies or the machine gets shutdown/reboot [22:53:30] (03PS2) 10Ori.livneh: Write redis's pid file to /var/lib/redis by default. [puppet] - 10https://gerrit.wikimedia.org/r/268580 [22:53:45] (03CR) 10Ori.livneh: [C: 032 V: 032] Write redis's pid file to /var/lib/redis by default. [puppet] - 10https://gerrit.wikimedia.org/r/268580 (owner: 10Ori.livneh) [22:53:50] (03PS1) 10Subramanya Sastry: parsoid-rt-client: parsoidConfig reqd even if using a global parsoid svc [puppet] - 10https://gerrit.wikimedia.org/r/268584 [22:53:54] + automatic repool after all services are back up on boot, if admin requested it via "touch /etc/traffic-pool-once" before reboot [22:54:47] 10Ops-Access-Requests, 6operations: access to Gerrit for Legal Contractor - https://phabricator.wikimedia.org/T125908#2000370 (10eliza) 3NEW [22:55:07] (03CR) 10Subramanya Sastry: "Verified on ruthenium that this fixes the broken rt-testing clients." [puppet] - 10https://gerrit.wikimedia.org/r/268584 (owner: 10Subramanya Sastry) [22:55:30] (03PS7) 10BBlack: add traffic-pool systemd service unit [puppet] - 10https://gerrit.wikimedia.org/r/268420 [22:55:47] (03PS2) 10Milimetric: Update AQS config with new syntax [puppet] - 10https://gerrit.wikimedia.org/r/268560 (https://phabricator.wikimedia.org/T122249) [22:55:50] bblack: why /etc and not, say, /var/run? [22:56:11] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#2000387 (10matmarex) >>! In T124440#1999848, @Anomie wrote: > Now that wmf.12 is everywhere, an alternative would be to just kill the job and set `$wgAuthenticationTo... [22:56:21] well not /var/run because /var/run doesn't survive a reboot, and the whole purpose of that file is to survive a reboot [22:56:40] it could be elsewhere in /var, though [22:56:59] either way, it's a one-shot existence flag that survives one reboot, and gets removed by the script when consumed [22:57:06] oh, duh, right. [22:58:18] now I have to go look at the FHS doc again and see if it has useful things to say heh. it's been a while [22:58:49] 10Ops-Access-Requests, 6operations: access to Gerrit for Legal Contractor - https://phabricator.wikimedia.org/T125908#2000389 (10Krenair) 5Open>3Invalid a:3Krenair It's not a restricted system, anyone can get a gerrit account. Once you have a username, a gerrit admin (including but not restricted to ops)... [22:59:00] probably /var/lib/ [22:59:46] 10Ops-Access-Requests, 6operations: access to Gerrit for Legal Contractor - https://phabricator.wikimedia.org/T125908#2000398 (10Krenair) https://www.mediawiki.org/wiki/Developer_access [23:00:05] touch /var/lib/traffic-pool/pool-once would be more FHS-correct yeah [23:00:32] 10Ops-Access-Requests, 6operations: access to Gerrit for Legal Contractor - https://phabricator.wikimedia.org/T125908#2000402 (10Peachey88) The contractor will need to a [[ https://wikitech.wikimedia.org/wiki/Special:UserLogin/signup | Wikitech ]] account as a start [23:03:37] (03CR) 10Ori.livneh: "Great stuff. The macro trick doesn't actually work for me, presumably because macros are saved to ~/.hphpd.ini by default, and we sudo to " [puppet] - 10https://gerrit.wikimedia.org/r/268541 (owner: 10EBernhardson) [23:04:04] (03PS3) 10Ottomata: [WIP] Refactor manifests/role/analytics/* into modules/role, use hiera to configure [puppet] - 10https://gerrit.wikimedia.org/r/267797 (https://phabricator.wikimedia.org/T109859) [23:04:06] (03PS2) 10Tim Starling: parsoid-rt-client: parsoidConfig reqd even if using a global parsoid svc [puppet] - 10https://gerrit.wikimedia.org/r/268584 (owner: 10Subramanya Sastry) [23:04:10] (03CR) 10Tim Starling: [C: 032] parsoid-rt-client: parsoidConfig reqd even if using a global parsoid svc [puppet] - 10https://gerrit.wikimedia.org/r/268584 (owner: 10Subramanya Sastry) [23:04:52] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#2000416 (10Legoktm) About 30 minutes ago it was at gu_id 10467742. There are 45463166 total users. [23:05:22] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Refactor manifests/role/analytics/* into modules/role, use hiera to configure [puppet] - 10https://gerrit.wikimedia.org/r/267797 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata) [23:06:20] (03PS8) 10BBlack: add traffic-pool systemd service unit [puppet] - 10https://gerrit.wikimedia.org/r/268420 [23:06:45] (03PS2) 10MarcoAurelio: Adding museumvictoria.com.au domain to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267677 (https://phabricator.wikimedia.org/T125387) [23:07:29] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [23:07:29] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [23:08:50] (03PS1) 10Mobrovac: RESTBase: Labs: Use relative module path [puppet] - 10https://gerrit.wikimedia.org/r/268587 [23:09:08] (03PS9) 10BBlack: add traffic-pool systemd service unit [puppet] - 10https://gerrit.wikimedia.org/r/268420 [23:09:18] (03PS2) 10Mobrovac: RESTBase: Labs: Use relative module path [puppet] - 10https://gerrit.wikimedia.org/r/268587 [23:11:39] (03PS1) 10Chad: Xrumer antibot filters not deployed anymore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268590 [23:12:08] (03PS2) 10MarcoAurelio: Set $wgEnotifMinorEdits = true on huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267558 (https://phabricator.wikimedia.org/T125351) [23:12:32] (03PS7) 10MarcoAurelio: Enabling Extension:ShortUrl on or.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267780 (https://phabricator.wikimedia.org/T124429) [23:12:42] (03PS2) 10MarcoAurelio: Enabling Extension:ShortUrl for bhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267783 (https://phabricator.wikimedia.org/T113348) [23:12:53] (03CR) 10Chad: [C: 032] Xrumer antibot filters not deployed anymore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268590 (owner: 10Chad) [23:12:55] (03PS2) 10MarcoAurelio: Removing testwiki from wmgUseShortUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267784 [23:13:18] (03PS2) 10MarcoAurelio: Enabling Ext:ShortURL for maiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268414 (https://phabricator.wikimedia.org/T125802) [23:13:40] (03CR) 10Chad: [V: 032] Xrumer antibot filters not deployed anymore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268590 (owner: 10Chad) [23:14:39] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:14:39] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:15:00] (03CR) 10Mobrovac: "Reverted in I208df45e2537d2f19a7532b0f37f897bf6f9ce53" [puppet] - 10https://gerrit.wikimedia.org/r/267917 (owner: 10Mobrovac) [23:15:31] (03PS4) 10Ori.livneh: Make mobile-beta an available platform [puppet] - 10https://gerrit.wikimedia.org/r/268329 (owner: 10Jdlrobson) [23:15:37] (03CR) 10Ori.livneh: [C: 032 V: 032] Make mobile-beta an available platform [puppet] - 10https://gerrit.wikimedia.org/r/268329 (owner: 10Jdlrobson) [23:18:10] 6operations, 10OTRS: Upload AgentLoginLogo file to OTRS skins directory - https://phabricator.wikimedia.org/T125911#2000447 (10Rjd0060) 3NEW [23:18:22] 10Ops-Access-Requests, 6operations: access to Gerrit for Legal Contractor - https://phabricator.wikimedia.org/T125908#2000463 (10eliza) Thank you Krenair and Peachy88! [23:18:59] Faillllllllllll. [23:19:23] 6operations, 10ops-codfw, 5Patch-For-Review: onsite setup for sarin (WMF5851) - https://phabricator.wikimedia.org/T125753#2000478 (10RobH) 5Open>3Resolved [23:19:25] 6operations, 10Salt: setup/deploy sarin(WMF5851) as a salt master in codfw - https://phabricator.wikimedia.org/T125752#2000479 (10RobH) [23:21:30] 6operations, 10OTRS: Upload AgentLogo file to OTRS skins directory - https://phabricator.wikimedia.org/T125912#2000491 (10Krenair) [23:24:52] RoanKattouw , Krenair and ostriches - WRT shorturl patches deployements scheduled for today, I was told to tell you that before merge, shorturl tables have to be added to the DBs of those (small) wikis [23:25:28] if it's too much work, I can remove those from the deployement calendar and add them at a later date [23:26:04] ori: don't mind at all, also back now [23:26:17] (03CR) 10MarcoAurelio: "Needs schema change (shorturl tables added to the project) before merge." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267780 (https://phabricator.wikimedia.org/T124429) (owner: 10MarcoAurelio) [23:26:20] mutante: shall we reboot rcs1001? [23:26:30] ori: yes [23:26:47] you or me? [23:26:54] i'll do it [23:26:57] (03CR) 10MarcoAurelio: "Needs schema change (shorturl tables added to the project) before merge." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267783 (https://phabricator.wikimedia.org/T113348) (owner: 10MarcoAurelio) [23:27:00] thanks [23:27:43] (03CR) 10MarcoAurelio: "Maybe on this one shorturl tables should be removed from the project instead too." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267784 (owner: 10MarcoAurelio) [23:27:58] (03CR) 10MarcoAurelio: "Needs schema change (shorturl tables added to the project) before merge." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268414 (https://phabricator.wikimedia.org/T125802) (owner: 10MarcoAurelio) [23:29:00] (03PS1) 10Aude: Set $wgArticlePlaceholderImageProperty for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268592 [23:29:10] !log rcs1001 - rebooting [23:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:30:08] (03PS10) 10BBlack: add traffic-pool systemd service unit [puppet] - 10https://gerrit.wikimedia.org/r/268420 [23:30:20] (03CR) 10BBlack: [C: 032 V: 032] add traffic-pool systemd service unit [puppet] - 10https://gerrit.wikimedia.org/r/268420 (owner: 10BBlack) [23:32:03] mafk, creating extension tables is simple, it's fine [23:32:11] (for most extensions) [23:32:30] Glad to hear [23:32:58] (03PS3) 10Krinkle: Removing testwiki from wmgUseShortUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267784 (owner: 10MarcoAurelio) [23:33:08] (03PS4) 10Krinkle: Remove testwiki from wmgUseShortUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267784 (owner: 10MarcoAurelio) [23:33:22] (03CR) 10Krinkle: "Grammer corrections." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267784 (owner: 10MarcoAurelio) [23:33:43] <_joe_> bblack: that patch is _awesome_ [23:33:53] <_joe_> and we _need_ to do the same for the appserver layer [23:34:16] _joe_: yeah I think it could be refactored more-generically, but that can come after. let's see if it works well in practice first :) [23:34:22] <_joe_> yes [23:34:36] <_joe_> ofc, you're my guinea pigs :) [23:34:59] Wouldn't it be better to have PyBal's health-checks be easily extensible, so they can check for something more meaningful than "something is listening on port 80"? [23:35:05] and depool/repool accordingly? [23:35:50] <_joe_> ori: the pybal checks have a lesser granularity [23:36:08] <_joe_> and the idea is, well, to make this action deterministic [23:36:35] (03PS1) 10Nuria: Increase length of lag window to 100 [puppet] - 10https://gerrit.wikimedia.org/r/268594 (https://phabricator.wikimedia.org/T125916) [23:36:40] also this mechanism gets it done ahead, instead of depooling when stuff's already failing (in the case of reboot) [23:36:44] <_joe_> of course you can serve a special url that you can modify in order to depool a single machine [23:36:51] <_joe_> that would be the same in fact [23:37:26] <_joe_> I proposed it, people didn't like the fact it's non-deterministic (i.e. depends on checks that are randomly scheduled) [23:38:01] <_joe_> and I tend to agree with that esp. for varnishes that serve a ridiculous amount of req/s [23:38:09] ori: after reboot.. ran puppet. Service[redis-instance-tcp_6379]/ensure: ensure changed 'stopped' to 'running'. service redis-instance-tcp_6379 status start/running .. but rcstream still is not [23:38:29] redis is not up [23:38:38] upstart is not tracking it correctly, still [23:38:53] PROBLEM - puppet last run on cp3004 is CRITICAL: Timeout while attempting connection [23:38:53] PROBLEM - puppet last run on cp3012 is CRITICAL: Timeout while attempting connection [23:38:53] PROBLEM - puppet last run on cp3007 is CRITICAL: Timeout while attempting connection [23:38:53] PROBLEM - puppet last run on cp3017 is CRITICAL: Timeout while attempting connection [23:38:53] PROBLEM - puppet last run on cp3005 is CRITICAL: Timeout while attempting connection [23:38:57] <_joe_> ori: I think I fixed it for the main redis cluster [23:39:01] <_joe_> WHOA [23:39:03] just got a http 500 [23:39:04] <_joe_> wtf? [23:39:10] <_joe_> bblack: ^^ [23:39:40] <_joe_> varnish is up on cp3004 [23:39:46] puppet is disabled everywhere [23:39:52] or should be [23:40:04] <_joe_> oh timeout is a strange message though [23:40:39] one or two got through on close timing (got the change applied), but not a lot [23:40:49] there shouldn't be 500s anyways, the service doesn't even get started by default, either [23:40:56] (on boot, but not on puppetize) [23:41:21] I got a single 500 and then it was fine again :/ [23:41:27] <_joe_> it's the second consecutive spike of 5xx I see [23:41:47] I really don't think it's related to traffic-pool [23:41:56] <_joe_> https://grafana.wikimedia.org/dashboard/db/varnish-http-errors [23:42:02] <_joe_> I don't either [23:42:04] PROBLEM - traffic-pool service on cp1060 is CRITICAL: CRITICAL - Unit traffic-pool is active but reported exited [23:42:29] <_joe_> uhm [23:42:30] <_joe_> :) [23:42:34] ^ that's expected-ish, cp1060 is one I'm testing on, and not serving users [23:42:40] (regardless of pool state) [23:42:56] cp3015 will come up in a minute for that icinga check too, because it raced, but again this doesn't affect users [23:43:20] apparently the nrpe check is less than ideal for this scenario [23:43:27] <_joe_> why? [23:43:57] it doesn't think my service is active, because it "exited". but it is in fact active-but-exited (RemainAfterExit=true) [23:44:22] ● traffic-pool.service - Traffic Services Pool Control Loaded: loaded (/lib/systemd/system/traffic-pool.service; enabled) Active: active (exited) since Thu 2016-02-04 22:51:08 UTC; 53min ago [23:44:25] <_joe_> bblack: yeah the check we have in place is _not_ for oneshots [23:44:34] PROBLEM - traffic-pool service on cp3015 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive [23:44:42] <_joe_> or maybe there is some logic in there [23:44:58] <_joe_> you should check if we added logic for the oneshots or not [23:45:02] yeah [23:45:05] <_joe_> I kinda remember something like that [23:45:18] <_joe_> but now it's time I go to bed I guess :) [23:45:22] also I guess "critical => false" doesn't do what I expected :P [23:45:38] <_joe_> it doesn't page us [23:45:40] nite! [23:45:41] ok [23:45:44] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [23:45:44] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [23:45:47] <_joe_> which is good [23:45:51] I figured that would downgrade to a warning or something [23:46:25] it just changes the contact group to have the "sms" contact for pagning [23:47:03] <_joe_> it's like "critical, but for realz" [23:47:03] and if it sends email to alerts@ [23:47:13] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [23:47:28] <_joe_> cause you know, who cares about those mundane criticals that don't page... [23:47:54] supercrit [23:48:51] (03PS3) 10Dzahn: RESTBase: Labs: Use relative module path [puppet] - 10https://gerrit.wikimedia.org/r/268587 (owner: 10Mobrovac) [23:50:52] (03CR) 10Dzahn: [C: 032] RESTBase: Labs: Use relative module path [puppet] - 10https://gerrit.wikimedia.org/r/268587 (owner: 10Mobrovac) [23:50:56] (03PS1) 10BBlack: check_systemd_unit_state: allow active+exited for oneshot [puppet] - 10https://gerrit.wikimedia.org/r/268596 [23:51:50] (03PS2) 10BBlack: check_systemd_unit_state: allow active+exited for oneshot [puppet] - 10https://gerrit.wikimedia.org/r/268596 [23:52:03] (03CR) 10Jdlrobson: [C: 031] Don't request pageprops for mobile search/nearby on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268208 (https://phabricator.wikimedia.org/T120197) (owner: 10Aude) [23:52:17] (03PS1) 10Mobrovac: RESTBase: Change to v0.10.x style config [puppet] - 10https://gerrit.wikimedia.org/r/268597 [23:52:24] mutante: ^^^ [23:52:35] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [23:52:41] mutante: to be on the safe side, i'll disable puppet in prod now so i can test in peace in staging [23:53:05] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:53:05] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:54:03] !log restbase disabled temporarily puppet in prod to test https://gerrit.wikimedia.org/r/#/c/268597/ [23:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:57:45] mutante: i disabled puppet in rb prod, so we are ok to proceed with https://gerrit.wikimedia.org/r/#/c/268597/ whenever you have time [23:58:38] (03CR) 10Dzahn: [C: 032] "same as in labs before and confirmed the base_path on cerium too (and puppet is disabled)" [puppet] - 10https://gerrit.wikimedia.org/r/268597 (owner: 10Mobrovac) [23:59:08] mobrovac: ..and it's on the puppetmaster [23:59:53] thnx mutante!