[00:02:12] matt_flaschen, please test on testwiki [00:02:28] it's sync'd there [00:02:34] Krenair, will do, one sec. [00:02:36] kk [00:04:38] (03PS1) 10Yuvipanda: ssh: Allow customizing authorized_keys_command [puppet] - 10https://gerrit.wikimedia.org/r/249030 (https://phabricator.wikimedia.org/T113979) [00:05:01] (03PS2) 10Yuvipanda: ssh: Allow customizing authorized_keys_command [puppet] - 10https://gerrit.wikimedia.org/r/249030 (https://phabricator.wikimedia.org/T113979) [00:05:21] Krenair, hmm, not behaving as I expect at https://test.wikipedia.org/wiki/User_talk:Mattflaschen-WMF . I might be understanding something. My understanding was this output should have data-parsoid, but it doesn't. [00:05:24] ^ RoanKattouw [00:05:49] Don't we strip data-parsoid in ContentFixer? [00:06:04] One test is to type {{{foo}}} in the wikitext editor, switch to VE, then switch back to WT [00:06:20] RoanKattouw, no, I don't think so. And if we did, we shouldn't have the ID replacement anyway. [00:06:30] ugh, one moment [00:06:32] If the second switch doesn't cause a 500 from the server, then something is right [00:06:45] RECOVERY - puppet last run on db2017 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [00:06:46] maybe my -i to sync-common was wrong [00:06:54] try now [00:07:31] RoanKattouw, I also did try that first, and probably broke https://test.wikipedia.org/wiki/Talk:Sandbox . [00:07:44] (03CR) 10Yuvipanda: [C: 032 V: 032] ssh: Allow customizing authorized_keys_command [puppet] - 10https://gerrit.wikimedia.org/r/249030 (https://phabricator.wikimedia.org/T113979) (owner: 10Yuvipanda) [00:07:46] Oh, yeah, you don't have to save [00:08:06] switching with {{{foo}}} WFM [00:09:33] Krenair, yeah, looks right now. [00:09:50] data-parsoid is now there (in fixed-html view). [00:09:58] ok, that was my fault it didn't take effect [00:10:59] It's okay. [00:11:13] Now it's going to rest of the sites [00:11:14] !log krenair@tin Synchronized php-1.27.0-wmf.3/extensions/Flow/includes/Parsoid/Utils.php: https://gerrit.wikimedia.org/r/#/c/249026 (duration: 00m 18s) [00:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:14:01] Thanks [00:14:41] Everything good now? [00:16:01] Krenair, well, I did verify it was deployed to other . Unfortunately, it didn't solve the particular issue we hoped it would solve (though it will still solve the other issues it was originally done for). [00:16:50] alright [00:33:32] (03PS1) 10John F. Lewis: mailman: remove check for out queue [+data cron] [puppet] - 10https://gerrit.wikimedia.org/r/249037 [00:33:42] (03PS2) 10John F. Lewis: mailman: remove check for out queue [+data cron] [puppet] - 10https://gerrit.wikimedia.org/r/249037 [00:33:53] mutante: ^^ [00:34:29] wait [00:34:36] icigna check itself [00:35:17] (03PS3) 10John F. Lewis: mailman: remove check for out queue [+data cron] [puppet] - 10https://gerrit.wikimedia.org/r/249037 [00:35:26] (03PS4) 10John F. Lewis: mailman: remove check for out queue [+data cron] [puppet] - 10https://gerrit.wikimedia.org/r/249037 [00:35:33] mutante: now you can merge :) [00:39:06] JohnFLewis: does "25" make sense? [00:39:31] mutante: we can evaluate that after :) [00:39:33] (03PS1) 10Dzahn: mariadb: 32 lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/249038 [00:39:35] i agree about removing "out" [00:39:41] one change, one topic [00:39:47] fair [00:39:47] but let me check the data we have and see [00:40:46] (03CR) 10Dzahn: [C: 032] "i agree the out queue doesn't add much value and we watched it, a spike here is a normal thing to occur, unlike for the other queues" [puppet] - 10https://gerrit.wikimedia.org/r/249037 (owner: 10John F. Lewis) [00:41:29] fun; the virgin queue is permanently stuck on 1. /me looks [00:41:54] -c 72 [00:42:11] !log ori@tin Synchronized php-1.27.0-wmf.3/extensions/AbuseFilter: I2f84cff0: Avoid pointless range scan for 'load-recent-authors' (T116557) (duration: 00m 18s) [00:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:44:20] (03CR) 10Dzahn: "yep, removed cron job and data file manually" [puppet] - 10https://gerrit.wikimedia.org/r/249037 (owner: 10John F. Lewis) [00:45:30] (03PS2) 10Dzahn: mariadb: 32 lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/249038 [00:50:42] 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1756028 (10RobH) I haven't gotten to this in time today. I may get some of it setup in advance of tomorrow, but likely I'll simply be picking this back up in the AM for completion. [00:52:38] 6operations, 10ops-ulsfo: Move NTT @ ulsfo to a different cross-connect - https://phabricator.wikimedia.org/T112154#1756037 (10RobH) Went onsite today, and moved the patch. There was no light, even after we rolled the fiber. I' contacted Kevin @ NTT (out of band email thread includes @mark & @faidon.) They... [00:55:21] Going to hotfix deploy https://gerrit.wikimedia.org/r/#/c/249040/ in a minute unless someone else is deploying something? [00:55:50] 6operations, 10TimedMediaHandler, 10hardware-requests: Assign 3 more servers to video scaler duty - https://phabricator.wikimedia.org/T114337#1756048 (10RobH) a:3RobH I'll be taking three of those idle systems off T116256 as Reedy points out. Since they are idle, they won't be missed! There are four on t... [00:56:18] 6operations: Investigate idle/depooled eqiad appservers - https://phabricator.wikimedia.org/T116256#1756053 (10RobH) a:3RobH I'll investigate these to ensure there aren't any hw faults, and then I'll be taking a few for video scaling and return the fourth to service. [00:59:07] (03PS4) 10Chad: Move dsh code into scap where it belongs [puppet] - 10https://gerrit.wikimedia.org/r/247304 [00:59:12] 6operations, 10MediaWiki-Cache, 6Performance-Team, 10hardware-requests, 7Availability: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1756057 (10RobH) I meant to get to this today, but other tasks took priority. I'll invest... [01:02:53] !log krinkle@tin Synchronized php-1.27.0-wmf.3/extensions/VisualEditor/modules/ve-mw/init/targets/ve.init.mw.DesktopArticleTarget.init.js: T116693 (duration: 00m 19s) [01:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:06:40] 6operations, 10hardware-requests: Site: 1 server hardware access request for initializing the codfw elasticsearch cluster. - https://phabricator.wikimedia.org/T116236#1756065 (10RobH) 5Open>3stalled This has a run time of 8 days, and doesn't seem to be causing any undue concern at the moment. I'd like to... [01:08:14] 6operations, 10hardware-requests: eqiad: 1 hardware access request for labs on real hardware - https://phabricator.wikimedia.org/T106731#1756069 (10RobH) Is this blocked by the deployment of T114435 in terms of labs testing on bare metal, or does this still require a bare metal server allocated in eqiad for th... [01:09:26] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10hardware-requests, 5iOS-5-app-production: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1756072 (10RobH) Would there be any drawback to running this inside a ganeti virtual machine, rather t... [01:17:15] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [5000000.0] [01:18:26] 6operations, 10hardware-requests, 7Performance: Refresh Parser cache servers pc1001-pc1003 - https://phabricator.wikimedia.org/T111777#1756081 (10RobH) @Jcrespo: Can you advise what specs would be ideal for parsercache use? The initial task assumes machines similar to the Ciscos, but often those ciscos wer... [01:20:45] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 1.00% above the threshold [1000000.0] [01:29:59] 6operations, 10Analytics, 10Deployment-Systems, 6Services, 3Scap3: Use Scap3 for deploying AQS - https://phabricator.wikimedia.org/T114999#1756099 (10Dzahn) [01:30:02] 6operations: Grant Joseph and Dan deploy permissions on aqs100[1-3] - https://phabricator.wikimedia.org/T116169#1756098 (10Dzahn) [02:24:29] !log l10nupdate@tin Synchronized php-1.27.0-wmf.3/cache/l10n: l10nupdate for 1.27.0-wmf.3 (duration: 08m 06s) [02:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:29:08] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.3) at 2015-10-27 02:29:08+00:00 [02:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:52:10] 7Puppet, 6Labs, 6Phabricator: phabricator puppet at labs broken - https://phabricator.wikimedia.org/T116442#1756160 (10Negative24) Prod uses some variables from the private puppet repo for setting passwords and such. But the prod class also sets up mail servers and relays that would be complex to manage in l... [03:05:05] (03CR) 10Hydriz: [C: 031] "Will this be merged soon? It looks good and simple enough to be merged to me." [puppet] - 10https://gerrit.wikimedia.org/r/235208 (owner: 10ArielGlenn) [03:09:03] PROBLEM - puppet last run on mw2017 is CRITICAL: CRITICAL: puppet fail [03:10:46] (03PS7) 10Chad: scap: Add co-master configuration [puppet] - 10https://gerrit.wikimedia.org/r/224829 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis) [03:10:48] (03PS5) 10Chad: Move dsh code into scap where it belongs [puppet] - 10https://gerrit.wikimedia.org/r/247304 [03:19:11] (03CR) 10Dzahn: [C: 031] "agree, looks good and simple enough, just +1 though because i don't know if it's waiting for an ok from legal or something. Ariel?" [puppet] - 10https://gerrit.wikimedia.org/r/235208 (owner: 10ArielGlenn) [03:24:07] 7Puppet, 6Labs, 6Phabricator: phabricator puppet at labs broken - https://phabricator.wikimedia.org/T116442#1756194 (10Dzahn) Ok, thanks for the explanation. I see.. I truly wish we could fix all this and really use the same role class in labs and prod. (and then apply changes in labs first for testing lik... [03:25:26] 7Puppet, 6Labs, 6Phabricator: phabricator puppet at labs broken - https://phabricator.wikimedia.org/T116442#1756195 (10Dzahn) The password part we can fix with the labs/private repo. Happy to help. Not so sure about the mail server part though. [03:30:11] (03PS1) 10Dzahn: maps: some puppet-lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/249052 [03:31:10] (03PS2) 10Dzahn: maps: some puppet-lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/249052 [03:38:44] RECOVERY - puppet last run on mw2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:40:36] (03PS1) 10Dzahn: osm/maps/postgres: move tuning.conf out of /files/ [puppet] - 10https://gerrit.wikimedia.org/r/249056 [03:42:16] !log krinkle@tin Synchronized php-1.27.0-wmf.3/languages/Language.php: hotfix for T116693 (duration: 00m 19s) [03:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:43:18] RoanKattouw_away: Deployed ^ , confirmed fix on zhwiki. Feel free to merge https://gerrit.wikimedia.org/r/#/c/249042/ [03:45:20] (03PS2) 10Dzahn: osm/maps/postgres: move tuning.conf out of /files/ [puppet] - 10https://gerrit.wikimedia.org/r/249056 [03:50:38] (03PS3) 10Dzahn: osm/maps/postgres: move tuning.conf out of /files/ [puppet] - 10https://gerrit.wikimedia.org/r/249056 [03:53:05] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 14 data above and 0 below the confidence bounds [04:01:30] (03PS3) 10Dzahn: maps: some puppet-lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/249052 [04:01:32] (03PS4) 10Dzahn: osm/maps/postgres: move tuning.conf out of /files/ [puppet] - 10https://gerrit.wikimedia.org/r/249056 [04:01:34] (03PS1) 10Dzahn: (WIP) maps: move roles into autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/249059 [04:08:34] (03PS1) 10Dzahn: logstash: fix double quoted strings [puppet] - 10https://gerrit.wikimedia.org/r/249060 [04:14:28] (03CR) 10BryanDavis: [C: 04-1] "I'm pretty sure that role::logstash::eventlogging is in active use in both production and beta cluster." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/249060 (owner: 10Dzahn) [04:17:05] (03PS1) 10Dzahn: logstash: move files from ./files to module [puppet] - 10https://gerrit.wikimedia.org/r/249062 [04:18:31] (03CR) 10Dzahn: "..and i had no intention to remove that. sorry, rebase fail." [puppet] - 10https://gerrit.wikimedia.org/r/249060 (owner: 10Dzahn) [04:21:03] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 10.34% of data above the critical threshold [100000000.0] [04:22:46] (03PS2) 10Dzahn: logstash: fix double quoted strings [puppet] - 10https://gerrit.wikimedia.org/r/249060 [04:24:50] (03PS3) 10Dzahn: logstash: fix double quoted strings & alignments [puppet] - 10https://gerrit.wikimedia.org/r/249060 [04:26:58] (03PS4) 10Dzahn: logstash: fix double quoted strings & alignments [puppet] - 10https://gerrit.wikimedia.org/r/249060 [04:29:47] (03PS1) 10Dzahn: rm files/misc/apt-security-updates [puppet] - 10https://gerrit.wikimedia.org/r/249063 [04:29:54] PROBLEM - puppet last run on scb1001 is CRITICAL: CRITICAL: Puppet has 1 failures [04:30:10] 6operations, 10ops-codfw, 7Swift: rack & initial on-site setup of ms-be2016-2021 - https://phabricator.wikimedia.org/T114712#1756236 (10Papaul) I had the switch number position wrong so changing it ms-be2016 10.193.1.12 port xe-2/0/7 ms-be2017 10.193.1.13 port xe-7/0/7 ms-be2018 10.193.1.14 port xe-2/0/7 ms... [04:33:24] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 18.18% of data above the critical threshold [500.0] [04:35:27] (03CR) 10BryanDavis: [C: 031] "Not tested but it looks sane" [puppet] - 10https://gerrit.wikimedia.org/r/249060 (owner: 10Dzahn) [04:36:13] (03PS2) 10Dzahn: rm files/misc/apt-security-updates [puppet] - 10https://gerrit.wikimedia.org/r/249063 [04:38:14] (03CR) 10BryanDavis: "These files really go with the role and not the module. Maybe the role should be moved to the role module instead and take the files with " [puppet] - 10https://gerrit.wikimedia.org/r/249062 (owner: 10Dzahn) [04:40:23] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 18 data above and 9 below the confidence bounds [04:42:04] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:45:34] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 18 data above and 9 below the confidence bounds [04:50:44] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 18 data above and 9 below the confidence bounds [04:55:54] RECOVERY - puppet last run on scb1001 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [04:55:54] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 18 data above and 9 below the confidence bounds [05:12:18] bd808: Are there phabricator tasks for EL errors? [05:12:28] If not, I'll create some. There are several obvious trending ones [05:12:33] That's just data being lost [05:13:20] I think I filed one against mobilefrontend at some point... [05:13:28] but only that one [05:20:16] (03PS2) 10ArielGlenn: add apps development guidelines to legal text for dumps [puppet] - 10https://gerrit.wikimedia.org/r/235208 [05:21:51] (03CR) 10ArielGlenn: [C: 032] "I tried and failed to get them to comment on this specific change. (See T110742) Giving up and merging it." [puppet] - 10https://gerrit.wikimedia.org/r/235208 (owner: 10ArielGlenn) [05:23:44] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [05:25:57] (03PS8) 10EBernhardson: Generate weekly cirrussearch dumps [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) [05:36:36] (03PS1) 10ArielGlenn: add apps guidelines to legal info page on datasets [puppet] - 10https://gerrit.wikimedia.org/r/249068 [05:37:27] (03CR) 10ArielGlenn: [C: 032] add apps guidelines to legal info page on datasets [puppet] - 10https://gerrit.wikimedia.org/r/249068 (owner: 10ArielGlenn) [05:40:54] "they cannot use the Ansible playbook the Services Team is using for deployment." [05:41:00] services is using Ansible? [05:41:16] yes :/ [05:41:18] 6operations, 10Datasets-General-or-Unknown, 5Patch-For-Review: Add App Guidelines on Dumps Page - https://phabricator.wikimedia.org/T110742#1756313 (10ArielGlenn) 5Open>3Resolved All right, I've added this change on the page of just legal text too, I suppose that can't hurt anything. Since Legal signs o... [05:42:07] Krinkle: I think they are testing out scap3 as a replacement though [05:42:11] not sure what they are using it for, but i sure wish i could script together a rolling restart of the elasticsearch cluster over an ssh connection (like ansible makes possible)... [05:42:34] (i wrote one with fabfile, but it can't talk to our prod cluster) [05:42:38] I assume it's not used for any kind of provisioning, so more like python fab, not like puppet. [05:42:56] eventhough ansible is also often used as puppet replacement [05:43:04] (jQuery ops is switching from puppet to ansible as we speak) [05:43:30] they are using it instead of trebuchet [05:43:45] push code to nodes and script the restarts needed [05:44:14] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:46:03] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [05:47:02] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Oct 27 05:47:02 UTC 2015 (duration 47m 1s) [05:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:50:56] ebernhardson: why can't fabfile talk to prod? we use it for CI [05:53:35] legoktm: perhaps its been updated since then, last i tried the python implementation of SSH couldn't talk to our servers due to https://github.com/paramiko/paramiko/pull/356 [05:53:52] hmm, there is an older ticket than that [05:54:19] well, that one is about 16 months old, might be the one [05:54:32] ebernhardson: somehow I'm able to talk to gallium using fab. [05:55:13] legoktm: yeah, but gallium has a public IP, not proxy [05:55:40] could be a red herring, but if it has issues with honouring ProxyCommand, that woudl be the reason why it works for gallium [05:56:03] Though it seems unlikely [05:56:48] Krinkle: I don't think so, because it works when I use zuul.eqiad.wmnet which goes through ProxyCommand. [05:59:32] hmm, its been so long since i've run this i'm not even sure what command line i need, its not reading my ~/.ssh/config at all, although env.use_ssh_config is set [05:59:51] (or at least, i assume so since it cant find the domain name for elastic1001.eqiad.wmnet) [06:00:04] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:00:04] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:00:04] PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:00:04] PROBLEM - Restbase endpoints health on restbase1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:00:04] PROBLEM - Restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:00:05] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:01:53] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [06:01:54] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [06:01:54] RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy [06:02:34] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [06:02:49] legoktm: what command line are you using to tell fab where to talk to and what ssh config to use? [06:03:16] ebernhardson: https://phabricator.wikimedia.org/diffusion/CICF/browse/master/fabfile.py [06:03:41] I run `fab deploy_zuul` [06:05:23] RECOVERY - Restbase endpoints health on restbase1006 is OK: All endpoints are healthy [06:05:24] RECOVERY - Restbase endpoints health on restbase1009 is OK: All endpoints are healthy [06:05:24] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [06:06:32] legoktm: just ran it w/strace -e trace=network, it connected to 208.80.154.135 which doesn't look to be a bastion [06:10:16] actually it never connected, hmm [06:11:09] o.O [06:14:14] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:14:14] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:14:14] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:14:14] PROBLEM - Restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:14:14] PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:14:15] PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:14:19] (03CR) 10ArielGlenn: "how many days of logs do you want to keep? It's not specified in the logrotate conf." [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) (owner: 10EBernhardson) [06:15:53] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [06:15:53] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [06:15:53] RECOVERY - Restbase endpoints health on restbase1009 is OK: All endpoints are healthy [06:15:54] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [06:15:54] RECOVERY - Restbase endpoints health on restbase1001 is OK: All endpoints are healthy [06:15:55] RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy [06:16:04] legoktm: trying to talk to prod cluster i basically get this: https://phabricator.wikimedia.org/P2236 [06:16:11] tl/dr: SSHException: Incompatible ssh peer (no acceptable kex algorithm) [06:16:34] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 3 below the confidence bounds [06:17:20] which makes sense, as the bastion only provided two [u'curve25519-sha256@libssh.org', u'diffie-hellman-group-exchange-sha256'] and both require sha256 which is unimplemented [06:17:37] i wonder why yours works :S [06:19:57] the reason is because gallium (and lanthanum, and antimony) specifically turn off the ssh protections [06:20:29] ... [06:20:30] also nova controllers, the labs nfs filteserver, some integration hosts, and deployment-prep [06:20:32] lol [06:20:46] grep for 'dsiable_nist_kex' in hieradata basically :) [06:20:49] disable_nist_kex [06:21:53] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 3 below the confidence bounds [06:29:02] (03CR) 10EBernhardson: "oops let me add that. As for how long: enwiki takes about an hour and is significantly larger than all the others (see the ticket for size" [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) (owner: 10EBernhardson) [06:30:33] PROBLEM - puppet last run on scb1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:38] (03PS9) 10EBernhardson: Generate weekly cirrussearch dumps [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) [06:30:43] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:44] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: puppet fail [06:30:44] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:54] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:23] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:44] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:04] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:13] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:14] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:15] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:45] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:35:33] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:35:34] PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:36:04] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 0 below the confidence bounds [06:37:14] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [06:37:14] RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy [06:37:15] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [06:42:34] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:34] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:34] PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:34] PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:04] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 0 below the confidence bounds [06:44:15] RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy [06:44:15] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [06:44:15] RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy [06:46:04] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [06:50:19] <_joe_> uh what the heck happened to restbase? [06:51:23] PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:51:23] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:51:52] <_joe_> oook restbase is out, apparently [06:51:57] <_joe_> lemme recheck [06:53:03] RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy [06:53:04] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [06:55:03] RECOVERY - puppet last run on scb1001 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:55:05] <_joe_> great, logstash is down [06:55:08] <_joe_> at least kibana [06:55:14] <_joe_> and restbase has no local logs [06:55:35] <_joe_> but please, don't listen to this old fart when he says we should also log locally.... [06:55:58] i said so too and i'm not old (yet) [06:56:35] <_joe_> tut tut, you're implicitly old [06:57:04] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:04] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:57:07] <_joe_> ori: regarding sniffly, it's a well known leaking vector [06:57:14] <_joe_> (HSTS) [06:57:24] <_joe_> no one ever bothered to implement a demo though [06:58:03] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:03] <_joe_> ok logstash isn't actually down, I'm just waiting for it since 2 minutes [06:58:24] PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:59:57] https://tools.ietf.org/html/rfc6797#section-14.9 [07:00:02] ^ kinda hints at those things [07:00:15] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:00:15] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:00:23] PROBLEM - Restbase endpoints health on restbase1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:00:55] CSP makes it easy though :) [07:01:50] <_joe_> and btw, this seems to be an NRPE issue [07:02:01] <_joe_> as on rb1002 I just ran the check manually and it's fine [07:02:03] RECOVERY - Restbase endpoints health on restbase1006 is OK: All endpoints are healthy [07:02:03] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [07:02:04] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [07:02:04] RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy [07:03:46] bblack: that was re: https://zyan.scripts.mit.edu/sniffly/ btw (warning, link is to proof-of-concept page that tries to figure out your browser history) [07:04:05] i am coming around to the conclusion that logstash is just not very good software [07:05:15] it mixes software components from different ecosystems without due regard for fit and cohesion [07:05:15] <_joe_> ori: oh I did that about 4 years ago :P [07:05:31] <_joe_> ori: "but install the next version" [07:05:35] redis tries to be UNIX, elastic is JVM, the ingress is ruby [07:05:38] right [07:05:46] and throw infinite hardware resources at it [07:05:55] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 10.34% of data above the critical threshold [100000000.0] [07:06:23] <_joe_> ori: yup [07:06:27] it's slow, it sucks up system resources like a vacuum [07:06:33] <_joe_> ori: well, we could buy splunk instead [07:06:34] on both back-end and front-end, which is quite an achievement [07:06:53] <_joe_> it has the same flaws, but it gives you that cozy enterprise feeling [07:07:23] <_joe_> also it works better when backed by a million-dollars SAN [07:07:43] i think the way to go is text files and some investment in a small handful of tasteful scripts [07:08:14] <_joe_> we should surely try to make logstash better, as a tool it has some value [07:08:28] <_joe_> I just don't see "logstash only logging" as a viable idea [07:08:39] yeah, i agree with both points [07:08:40] <_joe_> at least it makes *my* work harder [07:09:04] the biggest value probably comes from the fact that there is a class of developers who never looked at fluorine who look at the logs [07:09:18] <_joe_> ori: also, you can see global trends [07:10:16] <_joe_> but think if mediawiki didn't log php errors to the standard error.log, when we had the logging-induced mw outage... [07:10:29] <_joe_> it would've taken us hours to fix that [07:10:40] _joe_: re: global trends, pshhhhh [07:10:41] https://dpaste.de/Q3HC/raw [07:11:17] cf "some investment in a small handful of tasteful scripts" :) [07:13:43] _joe_: also check out https://asciinema.org/a/49azixwi3cc0sn2qlygmt8bpk [07:15:36] <_joe_> wow that is neat [07:15:51] <_joe_> ori: can you send an email to ops@ when you create these tools? [07:16:03] <_joe_> maybe not xenon-grep specifically [07:16:24] <_joe_> ]but log-trends can be very handy [07:16:57] yeah i have a task open for moving it out of my dotfiles [07:17:31] thanks for checking it out! :) [07:21:43] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 0 below the confidence bounds [07:23:04] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:23:04] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:25:35] PROBLEM - puppet last run on ms-be2013 is CRITICAL: CRITICAL: puppet fail [07:26:35] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [07:27:03] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [07:27:23] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [07:27:33] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [07:27:34] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [07:28:13] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [07:28:15] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:28:24] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [07:28:24] PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:28:24] PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:28:24] PROBLEM - Restbase endpoints health on restbase1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:29:01] <_joe_> I'm not exactly sure what's going on with these restbase alerts, I ran the script repeatedly on the machines and it's fine [07:30:27] <_joe_> I see some increased parsoid latencies [07:30:34] <_joe_> but that seems to be it [07:31:55] RECOVERY - Restbase endpoints health on restbase1006 is OK: All endpoints are healthy [07:32:03] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [07:33:45] PROBLEM - Restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:33:45] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:33:45] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:34:04] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 0 below the confidence bounds [07:35:44] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [07:35:47] <_joe_> on restbase1002: Tue Oct 27 07:33:34 UTC 2015All endpoints are healthy [07:36:36] <_joe_> so yeah, I'm not sure what is going on there and I have to run some errands, seems like something is failing, either nrpe or nagios itself [07:37:15] PROBLEM - Restbase endpoints health on restbase1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:37:15] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:39:03] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [07:39:03] RECOVERY - Restbase endpoints health on restbase1009 is OK: All endpoints are healthy [07:39:03] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [07:40:54] PROBLEM - puppet last run on cp4001 is CRITICAL: CRITICAL: puppet fail [07:42:25] RECOVERY - Restbase endpoints health on restbase1001 is OK: All endpoints are healthy [07:42:25] RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy [07:42:33] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [07:42:33] RECOVERY - Restbase endpoints health on restbase1006 is OK: All endpoints are healthy [07:42:44] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 0 below the confidence bounds [07:49:33] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:49:34] PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:49:34] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:51:13] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [07:51:14] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [07:51:15] RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy [07:51:54] RECOVERY - puppet last run on ms-be2013 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [07:53:14] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [08:00:15] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: puppet fail [08:02:04] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [08:02:07] 6operations, 6Analytics-Backlog, 10Datasets-General-or-Unknown: Requests to dumps.wikimedia.org should end up in hadoop wmf.webrequest via kafka! - https://phabricator.wikimedia.org/T116430#1756432 (10Addshore) 5duplicate>3Open Re opened per https://phabricator.wikimedia.org/T116429#1754706 [08:07:04] 6operations, 10OTRS: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#1756437 (10MartinK) Imho for volunteers like us the key benefit of OTRS Version 5 is the mobile ready user interface. Being able to Prozess some Tickets while commuting would realy increase our producti... [08:07:13] RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [08:07:23] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [08:08:54] PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:10:34] RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy [08:14:06] (03CR) 10Addshore: [C: 04-1] "It may be nicer to simply have a daily.* here so that others can also use this easily." [puppet] - 10https://gerrit.wikimedia.org/r/247866 (owner: 10Addshore) [08:14:23] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 0 below the confidence bounds [08:28:24] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:32:44] (03CR) 10ArielGlenn: "ok great, doing a test run of the script now on snapshot1003 (the cron jobs host). is there a ticket for this btw? I hunted around a bit " [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) (owner: 10EBernhardson) [08:33:33] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Indeed we want to kill ./files but what this patch does is moving files not referenced by a module into a module. The files are referenced" [puppet] - 10https://gerrit.wikimedia.org/r/249062 (owner: 10Dzahn) [08:34:07] (03CR) 10Alexandros Kosiaris: "Sigh, I just realized I echoed Bryan :-)" [puppet] - 10https://gerrit.wikimedia.org/r/249062 (owner: 10Dzahn) [08:36:08] (03CR) 10Alexandros Kosiaris: [C: 04-1] "file is referenced by role classes, should not be in the postgresql module class" [puppet] - 10https://gerrit.wikimedia.org/r/249056 (owner: 10Dzahn) [08:36:39] (03CR) 10Alexandros Kosiaris: [C: 032] maps: some puppet-lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/249052 (owner: 10Dzahn) [08:37:00] (03PS4) 10Alexandros Kosiaris: maps: some puppet-lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/249052 (owner: 10Dzahn) [08:40:53] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [08:46:58] 6operations, 6Discovery, 5codfw-rollout: [EPIC] Set up a CirrusSearch cluster in codfw (Dallas, Texas) - https://phabricator.wikimedia.org/T105703#1756475 (10Deskana) [08:59:18] 7Blocked-on-Operations, 10Ops-Access-Requests, 6operations, 3Discovery-Maps-Sprint, and 2 others: Kartotherian service logs inaccessible (systemd?) and not updated (/var/log) - https://phabricator.wikimedia.org/T115067#1756491 (10akosiaris) Change merged and tested. Resolving >>! In T115067#1755730, @Dzah... [09:00:05] aude: Dear anthropoid, the time has come. Please deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151027T0900). [09:01:40] (03CR) 10Alexandros Kosiaris: [V: 032] maps: some puppet-lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/249052 (owner: 10Dzahn) [09:01:44] (03PS1) 10ArielGlenn: dumps: fix up incrementals scripts to use changed WikiDump names [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/249074 [09:01:46] (03PS1) 10ArielGlenn: adds-changes: toss a few redundant classes and import from dumps lib [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/249075 [09:05:55] 6operations, 6Phabricator, 6Security: Phabricator dependence on wmfusercontent.org - https://phabricator.wikimedia.org/T104730#1756499 (10Bawolff) p:5High>3Low [09:08:26] !log aude@tin Synchronized php-1.27.0-wmf.3/extensions/CirrusSearch: Add forceParse UpdaterFlag and option in forceSearchIndex script (duration: 00m 19s) [09:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:08:56] 7Blocked-on-Operations, 6operations, 10Beta-Cluster-Infrastructure, 7HHVM: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1756500 (10Joe) @bd808 I will work on the wikitech settings right away, and I'll take a look at the other files as well. [09:09:13] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:09:13] PROBLEM - Restbase endpoints health on restbase1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:10:54] RECOVERY - Restbase endpoints health on restbase1006 is OK: All endpoints are healthy [09:10:54] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [09:17:16] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: fix camelcases in WikiDumps.py (part 1) [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/248866 (owner: 10ArielGlenn) [09:17:24] 7Blocked-on-Operations, 6operations, 10Beta-Cluster-Infrastructure, 7HHVM: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1756503 (10Joe) And btw yes - I think we could anyways move to use mira for the time being instead of tin. [09:17:40] (03PS10) 10Alexandros Kosiaris: etherpad: Move role into module [puppet] - 10https://gerrit.wikimedia.org/r/220085 [09:18:03] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:18:03] PROBLEM - Restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:18:03] PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:18:03] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:18:03] PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:18:04] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:18:04] PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:18:37] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: camelcases in wikiDumps.py (part 2) [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/248867 (owner: 10ArielGlenn) [09:19:43] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [09:19:43] RECOVERY - Restbase endpoints health on restbase1009 is OK: All endpoints are healthy [09:19:44] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [09:19:44] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 0 below the confidence bounds [09:19:44] RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy [09:19:44] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [09:19:45] RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy [09:20:38] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: fix up incrementals scripts to use changed WikiDump names [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/249074 (owner: 10ArielGlenn) [09:21:34] RECOVERY - Restbase endpoints health on restbase1001 is OK: All endpoints are healthy [09:24:09] (03PS1) 10Giuseppe Lavagetto: wikitech: make private settings file writable by owner [puppet] - 10https://gerrit.wikimedia.org/r/249076 (https://phabricator.wikimedia.org/T87036) [09:25:05] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [09:26:54] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:26:54] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:28:06] (03CR) 10Gilles: Made the session/main stashes write to both DCs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247325 (https://phabricator.wikimedia.org/T111575) (owner: 10Aaron Schulz) [09:30:43] PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:30:43] PROBLEM - Restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:31:45] (03CR) 10Gilles: [C: 031] varnish: add prototype cookie-based backend selection [puppet] - 10https://gerrit.wikimedia.org/r/247970 (https://phabricator.wikimedia.org/T91820) (owner: 10Ori.livneh) [09:32:05] (03CR) 10Mobrovac: [C: 031] cassandra: updated gc settings [puppet] - 10https://gerrit.wikimedia.org/r/248960 (https://phabricator.wikimedia.org/T106619) (owner: 10Eevans) [09:32:33] RECOVERY - Restbase endpoints health on restbase1009 is OK: All endpoints are healthy [09:32:33] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [09:32:33] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [09:32:37] 7Blocked-on-Operations, 6operations, 10Beta-Cluster-Infrastructure, 7HHVM, 5Patch-For-Review: Reimage mw1152 as a terbium replacement - https://phabricator.wikimedia.org/T116728#1756538 (10Joe) 3NEW a:3Joe [09:37:42] (03PS2) 10Giuseppe Lavagetto: wikitech: make private settings file writable by owner [puppet] - 10https://gerrit.wikimedia.org/r/249076 (https://phabricator.wikimedia.org/T87036) [09:38:03] PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:38:03] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:39:44] RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy [09:39:44] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [09:39:45] RECOVERY - Restbase endpoints health on restbase1001 is OK: All endpoints are healthy [09:41:37] (03PS2) 10ArielGlenn: adds-changes: toss a few redundant classes and import from dumps lib [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/249075 [09:44:06] (03CR) 10ArielGlenn: [C: 032 V: 032] adds-changes: toss a few redundant classes and import from dumps lib [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/249075 (owner: 10ArielGlenn) [09:44:41] (03PS1) 10Filippo Giunchedi: cassandra: switch to monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/249082 [09:45:05] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:45:13] PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:45:13] PROBLEM - Restbase endpoints health on restbase1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:45:56] (03PS2) 10Filippo Giunchedi: cassandra: switch to monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/249082 [09:47:05] 6operations, 10hardware-requests, 7Performance: Refresh Parser cache servers pc1001-pc1003 - https://phabricator.wikimedia.org/T111777#1756580 (10jcrespo) @RobH, [[ https://grafana.wikimedia.org/dashboard/db/server-board?from=1445119200000&to=1445723999999&var-server=pc*&var-network=eth0 | parsercaches ]]... [09:47:34] 6operations, 10Dumps-Generation: make dumps easy to rerun or clean up - https://phabricator.wikimedia.org/T110876#1756584 (10ArielGlenn) [09:47:35] 6operations, 10Continuous-Integration-Config, 10Dumps-Generation, 5Patch-For-Review, 7WorkType-Maintenance: operations/dumps repo should pass flake8 - https://phabricator.wikimedia.org/T114249#1756583 (10ArielGlenn) [09:48:44] RECOVERY - Restbase endpoints health on restbase1006 is OK: All endpoints are healthy [09:48:44] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [09:48:44] RECOVERY - Restbase endpoints health on restbase1001 is OK: All endpoints are healthy [09:49:07] 6operations, 10Dumps-Generation: make dumps easy to rerun or clean up - https://phabricator.wikimedia.org/T110876#1588735 (10ArielGlenn) [09:50:45] 6operations: Grant Joseph and Dan deploy permissions on aqs100[1-3] - https://phabricator.wikimedia.org/T116169#1756595 (10akosiaris) 5Open>3stalled After some IRC talk with @mobrovac this is currently deployed with Ansible still. There is a blocking task T114999 to migrate this to scap3. As already pointed... [09:54:14] PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:54:59] !log convert restbase-test2003 to cassandra multi-instance [09:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:55:52] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1756609 (10ArielGlenn) work plan looks like this: make sure jessie install looks good add salt master role, copy over all minion keys add master manually as secondary to one client, restart it... [09:57:53] PROBLEM - Restbase endpoints health on restbase1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:57:54] PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:57:54] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:57:54] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:57:54] PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:59:03] (03CR) 10Alexandros Kosiaris: [C: 04-1] "comments inline, need to figure out how to make this work. The issue present is a blocker for other migrations to the role module, e.g. et" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/249059 (owner: 10Dzahn) [09:59:34] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [09:59:44] RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy [09:59:47] 6operations: restore old mw config private svn repo from bacula - https://phabricator.wikimedia.org/T115937#1756610 (10akosiaris) I am not getting anything useful out of https://tools.wmflabs.org/sal/production?p=0&q=conf-svn&d= either... [10:01:35] RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy [10:01:35] RECOVERY - Restbase endpoints health on restbase1001 is OK: All endpoints are healthy [10:01:43] RECOVERY - Restbase endpoints health on restbase1006 is OK: All endpoints are healthy [10:01:43] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [10:03:50] (03CR) 10Alexandros Kosiaris: [C: 031] wikitech: make private settings file writable by owner [puppet] - 10https://gerrit.wikimedia.org/r/249076 (https://phabricator.wikimedia.org/T87036) (owner: 10Giuseppe Lavagetto) [10:07:13] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:07:13] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:07:13] PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:07:13] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:07:14] PROBLEM - Restbase endpoints health on restbase1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:07:14] PROBLEM - Restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:07:14] PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:09:03] RECOVERY - Restbase endpoints health on restbase1009 is OK: All endpoints are healthy [10:09:03] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [10:09:03] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [10:09:03] RECOVERY - Restbase endpoints health on restbase1006 is OK: All endpoints are healthy [10:09:04] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [10:11:46] !log disabling puppet and restarting mysql servers at db1069- this will create a small amount of lag on labs [10:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:12:51] (03PS11) 10Alexandros Kosiaris: etherpad: Move role into module [puppet] - 10https://gerrit.wikimedia.org/r/220085 [10:14:33] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:14:34] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:14:34] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:14:34] PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:18:04] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [10:18:04] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [10:18:04] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [10:18:05] RECOVERY - Restbase endpoints health on restbase1001 is OK: All endpoints are healthy [10:18:05] RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy [10:18:13] RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy [10:21:46] (03PS2) 10ArielGlenn: dumps: admin script to do cleanup, enter maintenance mode, etc [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/234971 [10:23:43] PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:23:44] PROBLEM - Restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:23:44] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:23:44] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:23:44] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:23:44] PROBLEM - Restbase endpoints health on restbase1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:23:44] PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:27:23] RECOVERY - Restbase endpoints health on restbase1006 is OK: All endpoints are healthy [10:27:23] RECOVERY - Restbase endpoints health on restbase1009 is OK: All endpoints are healthy [10:31:13] PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:33:04] PROBLEM - Restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:36:43] RECOVERY - Restbase endpoints health on restbase1009 is OK: All endpoints are healthy [10:36:43] RECOVERY - Restbase endpoints health on restbase1001 is OK: All endpoints are healthy [10:36:43] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [10:36:43] RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy [10:36:44] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [10:36:44] RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy [10:42:14] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:42:14] PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:42:14] PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:42:14] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [10:42:14] PROBLEM - Restbase endpoints health on restbase1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:42:14] PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:44:03] PROBLEM - Restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:47:44] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:47:44] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:49:33] RECOVERY - Restbase endpoints health on restbase1006 is OK: All endpoints are healthy [10:49:33] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [10:49:33] RECOVERY - Restbase endpoints health on restbase1009 is OK: All endpoints are healthy [10:49:33] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [10:49:34] RECOVERY - Restbase endpoints health on restbase1001 is OK: All endpoints are healthy [10:49:50] (03PS1) 10Giuseppe Lavagetto: role::deployment: move to role module [puppet] - 10https://gerrit.wikimedia.org/r/249090 [10:49:52] (03PS1) 10Giuseppe Lavagetto: role::deployment::server: reorganize code [puppet] - 10https://gerrit.wikimedia.org/r/249091 [10:49:54] (03PS1) 10Giuseppe Lavagetto: role::deployment: move things to deployment::mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/249092 [10:49:56] (03PS1) 10Giuseppe Lavagetto: deployment::mediawiki: rename wikitech::wiki::password class [puppet] - 10https://gerrit.wikimedia.org/r/249093 [10:50:15] !log stopping Jenkins due to an unclean state [10:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:51:54] apergos: what about the rest of the patches? [10:52:09] paravoid: haven't forgotten you, still in progress [10:52:25] k [10:52:52] I saw you reviewed half of them and they're really not that many, so I thought maybe you didn't see the rest [10:54:22] (03PS1) 10Giuseppe Lavagetto: role::deployment: remove test role [puppet] - 10https://gerrit.wikimedia.org/r/249094 [10:55:04] PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:55:04] PROBLEM - Restbase endpoints health on restbase1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:55:04] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:55:04] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:55:04] PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:56:02] !log Jenkins job https://integration.wikimedia.org/ci/job/operations-puppet-doc/ is broken. I am on it :-( [10:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:57:45] !log downtime restbase endpoints health for restbase1* while investigating [10:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:06:13] RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy [11:11:44] RECOVERY - Restbase endpoints health on restbase1001 is OK: All endpoints are healthy [11:11:45] RECOVERY - Restbase endpoints health on restbase1006 is OK: All endpoints are healthy [11:15:23] RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy [11:15:23] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [11:15:23] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [11:15:23] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [11:15:23] RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy [11:18:44] PROBLEM - Cassandra CQL query interface on restbase1007 is CRITICAL: Connection refused [11:19:15] PROBLEM - Cassandra database on restbase1007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (cassandra), command name java, args CassandraDaemon [11:20:49] 6operations, 10RESTBase-Cassandra: restbase endpoints health checks timing out - https://phabricator.wikimedia.org/T116739#1756819 (10fgiunchedi) 3NEW [11:20:53] (03PS1) 10Alexandros Kosiaris: maps: Tune replication parameters [puppet] - 10https://gerrit.wikimedia.org/r/249096 (https://phabricator.wikimedia.org/T116553) [11:21:36] mobrovac: going somewhere with https://phabricator.wikimedia.org/T116739 perhaps, thoughts? [11:23:43] !log cassandra OOM'd on restbase1007, restarting [11:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:24:44] RECOVERY - Cassandra database on restbase1007 is OK: PROCS OK: 1 process with UID = 111 (cassandra), command name java, args CassandraDaemon [11:26:04] RECOVERY - Cassandra CQL query interface on restbase1007 is OK: TCP OK - 0.002 second response time on port 9042 [11:27:52] godog: hopefully this will fix it, most of the RB errors were cass time outs [11:33:58] mobrovac: I'd be surprised and worried if a single instance oom'ing would affect it tbh [11:34:16] (03PS1) 10Joal: Update camus runs [puppet] - 10https://gerrit.wikimedia.org/r/249100 (https://phabricator.wikimedia.org/T113252) [11:34:46] godog: could be "the wrong" instance to go south :p [11:35:08] mobrovac: still, 3.2s for mobile-html is expected? [11:35:21] or 2.5s, you get the idea [11:36:19] that is indeed strange [11:36:54] the mobileapps service doesn't seem to be busy on scb100x [11:45:25] PROBLEM - Cassandra CQL query interface on restbase-test2003 is CRITICAL: Connection refused [11:45:46] godog: that's you ^^ ? [11:47:05] (03PS1) 10Filippo Giunchedi: cassandra: add restbase-test2003 instances [puppet] - 10https://gerrit.wikimedia.org/r/249101 [11:47:17] (03CR) 10Muehlenhoff: [C: 031] "Seems indeed like cruft, the log directory doesn't exist anywhere in the cluster." [puppet] - 10https://gerrit.wikimedia.org/r/249063 (owner: 10Dzahn) [11:47:23] (03PS1) 10Giuseppe Lavagetto: role::deployment::server: drop mod_dav [puppet] - 10https://gerrit.wikimedia.org/r/249102 [11:47:23] mobrovac: yup, silencing now [11:48:05] <_joe_> Reedy: ^^ [11:52:10] (03CR) 10Reedy: [C: 031] role::deployment::server: drop mod_dav [puppet] - 10https://gerrit.wikimedia.org/r/249102 (owner: 10Giuseppe Lavagetto) [11:52:22] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase-test2003 instances [puppet] - 10https://gerrit.wikimedia.org/r/249101 (owner: 10Filippo Giunchedi) [11:56:24] !log reimage restbase-test2003 [11:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:59:09] (03CR) 10Alexandros Kosiaris: [C: 032] maps: Tune replication parameters [puppet] - 10https://gerrit.wikimedia.org/r/249096 (https://phabricator.wikimedia.org/T116553) (owner: 10Alexandros Kosiaris) [11:59:14] (03PS2) 10Alexandros Kosiaris: maps: Tune replication parameters [puppet] - 10https://gerrit.wikimedia.org/r/249096 (https://phabricator.wikimedia.org/T116553) [11:59:18] (03PS3) 10Alex Monk: beta: Add enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248639 [11:59:20] (03CR) 10Alexandros Kosiaris: [V: 032] maps: Tune replication parameters [puppet] - 10https://gerrit.wikimedia.org/r/249096 (https://phabricator.wikimedia.org/T116553) (owner: 10Alexandros Kosiaris) [12:05:08] 6operations: Track amount of package updates on systems - https://phabricator.wikimedia.org/T116742#1756923 (10MoritzMuehlenhoff) 3NEW [12:11:52] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [12:12:21] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds [12:16:08] (03PS2) 10Muehlenhoff: Add missing Hiera data [puppet] - 10https://gerrit.wikimedia.org/r/248865 [12:17:52] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds [12:18:30] (03PS1) 10Jcrespo: Replicate pt-heartbeat table to labs. Stop replicating msg_resource [puppet] - 10https://gerrit.wikimedia.org/r/249105 (https://phabricator.wikimedia.org/T116720) [12:21:38] !log Just dropped msg_resource tables from labs dbs. Filters modified to stop replicationg them. Started replicating the heartbeat tables. [12:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:22:51] (03PS2) 10Jcrespo: Replicate pt-heartbeat table to labs. Stop replicating msg_resource [puppet] - 10https://gerrit.wikimedia.org/r/249105 (https://phabricator.wikimedia.org/T116720) [12:23:01] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:23:08] <_joe_> win 17 [12:23:24] win! [12:23:31] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [12:23:35] (03CR) 10Jcrespo: [C: 032] Replicate pt-heartbeat table to labs. Stop replicating msg_resource [puppet] - 10https://gerrit.wikimedia.org/r/249105 (https://phabricator.wikimedia.org/T116720) (owner: 10Jcrespo) [12:24:09] win? [12:29:05] something is wrong with heartbeat- only shard7 updates the table [12:33:01] will check it later, it is a new feature, not a bug [12:34:42] (03PS3) 10Muehlenhoff: Add missing Hiera data [puppet] - 10https://gerrit.wikimedia.org/r/248865 [12:34:54] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add missing Hiera data [puppet] - 10https://gerrit.wikimedia.org/r/248865 (owner: 10Muehlenhoff) [12:40:02] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [12:41:04] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [12:44:22] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 5.08% of data above the critical threshold [1000.0] [12:47:36] morebots: that you I guess? [12:47:36] I am a logbot running on tools-exec-1203. [12:47:36] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [12:47:36] To log a message, type !log . [12:47:38] er [12:47:48] moritzm: that you I guess? [12:47:57] the unmerged changes [12:48:51] I withheld my merge since there was one from jynus around (but he's AFK) [12:50:12] PROBLEM - check_mysql on payments1002 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [12:50:12] PROBLEM - check_mysql on payments1003 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [12:50:13] PROBLEM - check_mysql on payments1004 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [12:52:05] ^^^ that's due to mysql package upgrade post-inst stupidity. fixing. [12:52:38] (03PS1) 10Dereckson: Throttle rule for Wikisource editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249107 (https://phabricator.wikimedia.org/T116745) [12:52:52] I poked him earlier on IRC, but I'd rather avoid merging that when he's not around [12:53:24] chasemp: we have an emergency request for a throttle rule, for a right now wikisource editathon with 45 participants, so it's tedious to create accounts on wiki, would it be possible to deploy https://gerrit.wikimedia.org/r/#/c/249107/ ? [12:55:12] RECOVERY - check_mysql on payments1002 is OK: Uptime: 386 Threads: 1 Questions: 4214 Slow queries: 55 Opens: 470 Flush tables: 1 Open tables: 63 Queries per second avg: 10.917 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [12:55:12] RECOVERY - check_mysql on payments1003 is OK: Uptime: 359 Threads: 1 Questions: 4322 Slow queries: 59 Opens: 434 Flush tables: 1 Open tables: 63 Queries per second avg: 12.038 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [12:55:13] RECOVERY - check_mysql on payments1004 is OK: Uptime: 369 Threads: 2 Questions: 1034 Slow queries: 28 Opens: 433 Flush tables: 1 Open tables: 64 Queries per second avg: 2.802 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [12:58:59] !log disabling puppet and bringing down OTRS service on mendelevium [12:59:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:59:13] hmm also I should schedule downtime [12:59:27] (03CR) 10Alex Monk: [C: 032] Throttle rule for Wikisource editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249107 (https://phabricator.wikimedia.org/T116745) (owner: 10Dereckson) [12:59:36] (03Merged) 10jenkins-bot: Throttle rule for Wikisource editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249107 (https://phabricator.wikimedia.org/T116745) (owner: 10Dereckson) [13:00:29] !log krenair@tin Synchronized wmf-config/throttle.php: https://gerrit.wikimedia.org/r/#/c/249107/ (duration: 00m 18s) [13:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:00:39] Dereckson, done ^ [13:00:42] Thanks [13:03:56] Krenair: M0tty says thank you a lot. [13:05:40] no problem [13:07:41] 6operations: Meta task "Revamp user authentication" - https://phabricator.wikimedia.org/T116747#1757061 (10MoritzMuehlenhoff) 3NEW [13:08:12] 6operations: Meta task "Revamp user authentication" - https://phabricator.wikimedia.org/T116747#1757069 (10MoritzMuehlenhoff) [13:10:20] (03PS1) 10Giuseppe Lavagetto: r::mw::maintenance: include role::mediawiki::common [puppet] - 10https://gerrit.wikimedia.org/r/249108 (https://phabricator.wikimedia.org/T116728) [13:10:22] (03PS1) 10Giuseppe Lavagetto: mediawiki: group general monitoring scripts in a single role [puppet] - 10https://gerrit.wikimedia.org/r/249109 [13:10:24] (03PS1) 10Giuseppe Lavagetto: mw1152: convert to be the HAT maintenance host [puppet] - 10https://gerrit.wikimedia.org/r/249110 (https://phabricator.wikimedia.org/T116728) [13:11:46] (03CR) 10Florianschmidtwelzow: [C: 031] "+1!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248639 (owner: 10Alex Monk) [13:12:54] (03PS1) 10Faidon Liambotis: Revert "Replicate pt-heartbeat table to labs. Stop replicating msg_resource" [puppet] - 10https://gerrit.wikimedia.org/r/249111 [13:13:06] (03PS2) 10Faidon Liambotis: Revert "Replicate pt-heartbeat table to labs. Stop replicating msg_resource" [puppet] - 10https://gerrit.wikimedia.org/r/249111 [13:13:29] hmmm [13:13:38] this is labs, so this was probably already propagated into labs [13:13:43] fun [13:15:45] nah, these are labsdb100* and db1069 [13:15:58] (03CR) 10Faidon Liambotis: [C: 032] Revert "Replicate pt-heartbeat table to labs. Stop replicating msg_resource" [puppet] - 10https://gerrit.wikimedia.org/r/249111 (owner: 10Faidon Liambotis) [13:16:01] jynus: ^^^ [13:16:32] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [13:16:47] 6operations: 2FA for SSH access to the production cluster - https://phabricator.wikimedia.org/T116750#1757091 (10MoritzMuehlenhoff) 3NEW [13:16:48] (03PS3) 10Faidon Liambotis: rm files/misc/apt-security-updates [puppet] - 10https://gerrit.wikimedia.org/r/249063 (owner: 10Dzahn) [13:16:55] (03CR) 10Faidon Liambotis: [C: 032 V: 032] rm files/misc/apt-security-updates [puppet] - 10https://gerrit.wikimedia.org/r/249063 (owner: 10Dzahn) [13:18:20] 6operations: 2FA for SSH access to the production cluster - https://phabricator.wikimedia.org/T116750#1757105 (10MoritzMuehlenhoff) [13:18:21] 6operations: Meta task "Revamp user authentication" - https://phabricator.wikimedia.org/T116747#1757104 (10MoritzMuehlenhoff) [13:19:20] (03CR) 10Faidon Liambotis: "I'm not sure I see the point..." [puppet] - 10https://gerrit.wikimedia.org/r/249017 (owner: 10BBlack) [13:29:43] 7Blocked-on-Operations, 6operations, 6Discovery, 10Maps, and 2 others: allow maps cluster Varnish cache purging - https://phabricator.wikimedia.org/T112836#1757131 (10BBlack) [13:36:49] (03CR) 10Ottomata: "Aside from one nit, +1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/249100 (https://phabricator.wikimedia.org/T113252) (owner: 10Joal) [13:40:12] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure, 3labs-sprint-118: Rack/Setpup labvirt1010 and 1011 - https://phabricator.wikimedia.org/T116019#1757149 (10chasemp) p:5Triage>3Normal [13:40:24] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1757151 (10Ottomata) Ok, cool, I'm cool with that, so: `request_id` - UUID1 from Varnish, not necessarily unique for an individual event `event_id`... [13:40:34] 6operations, 10Traffic: Split HTCP multicast addresses - https://phabricator.wikimedia.org/T116752#1757153 (10BBlack) 3NEW [13:40:56] 6operations, 10Traffic: Split HTCP multicast addresses - https://phabricator.wikimedia.org/T116752#1757167 (10BBlack) [13:40:57] 7Blocked-on-Operations, 6operations, 6Discovery, 10Maps, and 2 others: allow maps cluster Varnish cache purging - https://phabricator.wikimedia.org/T112836#1757166 (10BBlack) [13:41:10] 6operations, 10Traffic: update the multicast purging documentation - https://phabricator.wikimedia.org/T82096#1757168 (10BBlack) [13:41:11] 6operations, 10Traffic: Split HTCP multicast addresses - https://phabricator.wikimedia.org/T116752#1757153 (10BBlack) [13:42:02] (03PS1) 10BBlack: vhtcpd: refac args template, allow multiple mc addrs [puppet] - 10https://gerrit.wikimedia.org/r/249117 (https://phabricator.wikimedia.org/T116752) [13:42:04] (03PS1) 10BBlack: HTCP: split multicast for cache clusters [puppet] - 10https://gerrit.wikimedia.org/r/249118 (https://phabricator.wikimedia.org/T112836) [13:44:26] 6operations, 10MediaWiki-Cache, 6Performance-Team, 10hardware-requests, 7Availability: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1757178 (10Ottomata) Hm, the R610s look good, although we don't need SSDs. If I had to cho... [13:47:23] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [13:48:15] (03PS1) 10Jcrespo: Revert "Revert "Replicate pt-heartbeat table to labs. Stop replicating msg_resource"" [puppet] - 10https://gerrit.wikimedia.org/r/249120 [13:48:22] (03PS2) 10Jcrespo: Revert "Revert "Replicate pt-heartbeat table to labs. Stop replicating msg_resource"" [puppet] - 10https://gerrit.wikimedia.org/r/249120 [13:48:32] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1757179 (10mobrovac) >>! In T116247#1754709, @Ottomata wrote: > What do y'all think about keeping these 'framing' fields in a nested object? I'm not... [13:49:03] (03CR) 10Jcrespo: [C: 032] Revert "Revert "Replicate pt-heartbeat table to labs. Stop replicating msg_resource"" [puppet] - 10https://gerrit.wikimedia.org/r/249120 (owner: 10Jcrespo) [13:49:15] (03PS1) 10BBlack: wgHTCPRouting: use separate address for upload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249121 [13:49:29] (03PS2) 10BBlack: wgHTCPRouting: use separate address for upload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249121 (https://phabricator.wikimedia.org/T116752) [13:50:24] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [13:51:33] (03PS2) 10BBlack: vhtcpd: refac args template, allow multiple mc addrs [puppet] - 10https://gerrit.wikimedia.org/r/249117 (https://phabricator.wikimedia.org/T116752) [13:51:35] (03PS2) 10BBlack: HTCP: split multicast for cache clusters [puppet] - 10https://gerrit.wikimedia.org/r/249118 (https://phabricator.wikimedia.org/T112836) [13:54:51] (03PS1) 10Alexandros Kosiaris: otrs: Ship systemd unit file for OTRS Daemon [puppet] - 10https://gerrit.wikimedia.org/r/249123 [13:58:56] 7Blocked-on-Operations, 6operations, 7HHVM, 5Patch-For-Review: Reimage mw1152 as a terbium replacement - https://phabricator.wikimedia.org/T116728#1757203 (10hashar) [13:59:43] bblack: regarding splitting HTCP announces ( $$wgHTCPRouting ), I don't think that feature has ever been used / properly tested [13:59:53] bblack: iirc I wrote it following a discussion with mark ages ago [14:00:04] kart_: Respected human, time to deploy ContentTranslation (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151027T1400). Please do the needful. [14:00:14] (03PS3) 10BBlack: vhtcpd: refac args template, allow multiple mc addrs [puppet] - 10https://gerrit.wikimedia.org/r/249117 (https://phabricator.wikimedia.org/T116752) [14:00:16] (03PS3) 10BBlack: HTCP: split multicast for cache clusters [puppet] - 10https://gerrit.wikimedia.org/r/249118 (https://phabricator.wikimedia.org/T112836) [14:00:25] hashar: ok :) [14:00:43] hashar: the cluster will listen on both for now when I push the puppet part, we can take time with the rest... [14:02:33] 6operations, 10Traffic: update the multicast purging documentation - https://phabricator.wikimedia.org/T82096#1757205 (10hashar) A word of caution, $wgHTCPRouting hasn't been used on the Wikimedia cluster and it might be broken. From operations/mediawiki-config.git: Production does not use it (squid.php): ``... [14:02:41] bblack: added a word of caution on the task [14:02:55] yes, sir jouncebot [14:03:01] (03CR) 10JanZerebecki: [C: 04-1] "This adds a cipher that is worse than all the others in the list. I don't get the reason." [puppet] - 10https://gerrit.wikimedia.org/r/249017 (owner: 10BBlack) [14:03:01] that might well save up some bandwidth / processing time [14:03:20] :) [14:04:52] PROBLEM - check_puppetrun on payments2003 is CRITICAL: CRITICAL: puppet fail [14:05:43] payments2003 alert ^^^ is just a puppetmaster reboot [14:09:05] (03PS1) 10coren: Remove msg_resource table from replication views [software] - 10https://gerrit.wikimedia.org/r/249124 (https://phabricator.wikimedia.org/T116720) [14:09:52] PROBLEM - check_puppetrun on payments2003 is CRITICAL: CRITICAL: puppet fail [14:09:53] PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: Puppet has 37 failures [14:11:15] 6operations, 10Traffic: Re-balance North American traffic in DNS for codfw - https://phabricator.wikimedia.org/T114659#1757224 (10faidon) [14:11:29] (03CR) 10BBlack: "In security terms 3DES isn't really worse than AES-CBC (there are tradeoffs, but not major in the grand scheme of things, mostly performan" [puppet] - 10https://gerrit.wikimedia.org/r/249017 (owner: 10BBlack) [14:12:06] (03CR) 10Jcrespo: [C: 031] Remove msg_resource table from replication views [software] - 10https://gerrit.wikimedia.org/r/249124 (https://phabricator.wikimedia.org/T116720) (owner: 10coren) [14:13:40] (03CR) 10EBernhardson: "the ticket is linked in the commit message :)" [puppet] - 10https://gerrit.wikimedia.org/r/248596 (https://phabricator.wikimedia.org/T109690) (owner: 10EBernhardson) [14:14:52] (03CR) 10coren: [C: 032] Remove msg_resource table from replication views [software] - 10https://gerrit.wikimedia.org/r/249124 (https://phabricator.wikimedia.org/T116720) (owner: 10coren) [14:14:52] PROBLEM - check_puppetrun on payments2003 is CRITICAL: CRITICAL: puppet fail [14:14:52] RECOVERY - check_puppetrun on payments2002 is OK: OK: Puppet is currently enabled, last run 110 seconds ago with 0 failures [14:15:03] (03CR) 10coren: [V: 032] Remove msg_resource table from replication views [software] - 10https://gerrit.wikimedia.org/r/249124 (https://phabricator.wikimedia.org/T116720) (owner: 10coren) [14:18:52] (03PS1) 10coren: maintain-replicas: match changed layout of mediawiki-config [software] - 10https://gerrit.wikimedia.org/r/249127 [14:19:52] RECOVERY - check_puppetrun on payments2003 is OK: OK: Puppet is currently enabled, last run 217 seconds ago with 0 failures [14:20:42] (03PS1) 10BBlack: upload purging: do not listen on text/mobile addr [puppet] - 10https://gerrit.wikimedia.org/r/249128 (https://phabricator.wikimedia.org/T116752) [14:20:44] (03PS1) 10BBlack: purging: do not VCL-filter on domain regex [puppet] - 10https://gerrit.wikimedia.org/r/249129 (https://phabricator.wikimedia.org/T116752) [14:23:28] (03PS3) 10BBlack: wgHTCPRouting: use separate address for upload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249121 (https://phabricator.wikimedia.org/T116752) [14:24:33] (03PS1) 10Filippo Giunchedi: cassandra: add restbase-test2003-b instance [puppet] - 10https://gerrit.wikimedia.org/r/249130 [14:24:51] (03CR) 10Hashar: [C: 031] role::deployment: remove test role [puppet] - 10https://gerrit.wikimedia.org/r/249094 (owner: 10Giuseppe Lavagetto) [14:25:49] (03PS4) 10BBlack: vhtcpd: refac args template, allow multiple mc addrs [puppet] - 10https://gerrit.wikimedia.org/r/249117 (https://phabricator.wikimedia.org/T116752) [14:25:51] (03PS4) 10BBlack: HTCP: split multicast for cache clusters [puppet] - 10https://gerrit.wikimedia.org/r/249118 (https://phabricator.wikimedia.org/T112836) [14:25:53] (03PS2) 10BBlack: purging: do not VCL-filter on domain regex [puppet] - 10https://gerrit.wikimedia.org/r/249129 (https://phabricator.wikimedia.org/T116752) [14:25:55] (03PS2) 10BBlack: upload purging: do not listen on text/mobile addr [puppet] - 10https://gerrit.wikimedia.org/r/249128 (https://phabricator.wikimedia.org/T116752) [14:26:33] (03PS2) 10Filippo Giunchedi: cassandra: add restbase-test2003-b instance [puppet] - 10https://gerrit.wikimedia.org/r/249130 [14:26:39] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase-test2003-b instance [puppet] - 10https://gerrit.wikimedia.org/r/249130 (owner: 10Filippo Giunchedi) [14:28:42] (03PS1) 10Jcrespo: Setting max_allowed_package to 32MB because otrs requires it [puppet] - 10https://gerrit.wikimedia.org/r/249132 [14:32:04] (03PS2) 10Joal: Update camus runs [puppet] - 10https://gerrit.wikimedia.org/r/249100 (https://phabricator.wikimedia.org/T113252) [14:33:43] (03PS5) 10BBlack: vhtcpd: refac args template, allow multiple mc addrs [puppet] - 10https://gerrit.wikimedia.org/r/249117 (https://phabricator.wikimedia.org/T116752) [14:33:58] (03CR) 10BBlack: [C: 032 V: 032] "Compiler-verified no-op" [puppet] - 10https://gerrit.wikimedia.org/r/249117 (https://phabricator.wikimedia.org/T116752) (owner: 10BBlack) [14:36:35] (03PS5) 10BBlack: HTCP: split multicast for cache clusters [puppet] - 10https://gerrit.wikimedia.org/r/249118 (https://phabricator.wikimedia.org/T112836) [14:39:01] (03PS1) 10Jcrespo: Change all mysql servers to max_allowed_package = 32MB [puppet] - 10https://gerrit.wikimedia.org/r/249135 [14:40:52] (03PS1) 10Filippo Giunchedi: cassandra: fix restbase-test2003-b JMX port [puppet] - 10https://gerrit.wikimedia.org/r/249136 [14:41:15] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: fix restbase-test2003-b JMX port [puppet] - 10https://gerrit.wikimedia.org/r/249136 (owner: 10Filippo Giunchedi) [14:42:38] (03PS2) 10Jcrespo: Setting max_allowed_package to 32MB because otrs requires it [puppet] - 10https://gerrit.wikimedia.org/r/249132 [14:43:45] (03CR) 10Jcrespo: [C: 032] Setting max_allowed_package to 32MB because otrs requires it [puppet] - 10https://gerrit.wikimedia.org/r/249132 (owner: 10Jcrespo) [14:46:59] (03PS6) 10BBlack: HTCP: split multicast for cache clusters [puppet] - 10https://gerrit.wikimedia.org/r/249118 (https://phabricator.wikimedia.org/T112836) [14:47:25] (03CR) 10BBlack: [C: 032 V: 032] "Compiler-verified changes look ok, manually tested on cp1071 and functions as expected." [puppet] - 10https://gerrit.wikimedia.org/r/249118 (https://phabricator.wikimedia.org/T112836) (owner: 10BBlack) [14:49:09] can anyone help us get rid of a gerrit replication please ? https://gerrit.wikimedia.org/r/#/c/244498/ :D [14:51:15] (03CR) 10BBlack: [C: 04-1] "Needs investigation first: @hashar said the MW code supporting this is old and probably untested." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249121 (https://phabricator.wikimedia.org/T116752) (owner: 10BBlack) [14:51:35] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1757410 (10GWicke) > I've been thinking about it too. Ideally, we could leave these fields out of schema defs, simply reference them. But, that seems... [14:53:11] 7Blocked-on-Operations, 6operations, 6Discovery, 10Maps, and 3 others: allow maps cluster Varnish cache purging - https://phabricator.wikimedia.org/T112836#1757416 (10BBlack) The maps clusters now listen for HTCP on `239.128.0.114`, can you configure your software to emit the HTCP to that address and then... [14:54:54] (03PS2) 10Jcrespo: Change all mysql servers to max_allowed_package = 32MB [puppet] - 10https://gerrit.wikimedia.org/r/249135 [14:55:34] (03CR) 10Hashar: "$wgHTCProuting has been introduced with https://gerrit.wikimedia.org/r/71597" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249121 (https://phabricator.wikimedia.org/T116752) (owner: 10BBlack) [14:56:19] (03CR) 10BBlack: "Should we block this on logging/stats for 429 responses first, so that we can see when/how it acts easily?" [puppet] - 10https://gerrit.wikimedia.org/r/241643 (owner: 10BBlack) [14:56:34] (03PS2) 10Alexandros Kosiaris: Update WikimediaTemplates to support 5.0.1 [software/otrs] - 10https://gerrit.wikimedia.org/r/248916 [14:56:44] (03PS2) 10Hashar: tox entry point to run pep8==1.4.6 [puppet] - 10https://gerrit.wikimedia.org/r/244148 (https://phabricator.wikimedia.org/T114887) [14:57:20] (03CR) 10jenkins-bot: [V: 04-1] tox entry point to run pep8==1.4.6 [puppet] - 10https://gerrit.wikimedia.org/r/244148 (https://phabricator.wikimedia.org/T114887) (owner: 10Hashar) [14:57:52] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 3.33% of data above the critical threshold [1000.0] [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151027T1500). Please do the needful. [15:00:04] James_F: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:01:52] !log TT112626 Ran fix-stats.php for CX (from bewiki to ruwiki) [15:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:03:19] James_F|Away: I can SWAT this morning, if you're around. [15:09:56] I don't think he is [15:13:01] (03CR) 10Alexandros Kosiaris: [C: 031] cassandra: switch to monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/249082 (owner: 10Filippo Giunchedi) [15:13:57] 6operations, 10netops, 10procurement: Decom Tele2 @ eqiad - https://phabricator.wikimedia.org/T115712#1757550 (10RobH) I never got a reply back, so I'll email Arul about this today. [15:15:15] !log reenable puppet on graphite1001 [15:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:19:00] 6operations, 10OTRS: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#1757569 (10Rjd0060) >>! In T74109#1756437, @MartinK wrote: > Imho for volunteers like us the key benefit of OTRS Version 5 is the mobile ready user interface. Being able to Prozess some Tickets while co... [15:19:32] (03PS2) 10Alexandros Kosiaris: otrs: Ship systemd unit file for OTRS Daemon [puppet] - 10https://gerrit.wikimedia.org/r/249123 [15:25:03] PROBLEM - configured eth on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:25:03] PROBLEM - puppet last run on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:25:03] PROBLEM - RAID on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:25:03] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:25:13] PROBLEM - Disk space on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:25:23] PROBLEM - salt-minion processes on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:25:33] PROBLEM - dhclient process on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:25:33] (03CR) 10Alexandros Kosiaris: [C: 032] otrs: Ship systemd unit file for OTRS Daemon [puppet] - 10https://gerrit.wikimedia.org/r/249123 (owner: 10Alexandros Kosiaris) [15:25:42] 6operations, 10ops-eqiad, 10netops: asw-d-eqiad SNMP failures - https://phabricator.wikimedia.org/T112781#1757593 (10BBlack) @cmjohnson - let's try to coordinate on this sometime today and test? [15:25:43] PROBLEM - Check size of conntrack table on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:25:46] (03PS3) 10Alexandros Kosiaris: otrs: Ship systemd unit file for OTRS Daemon [puppet] - 10https://gerrit.wikimedia.org/r/249123 [15:25:54] PROBLEM - Disk space on Hadoop worker on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:26:02] PROBLEM - Hadoop DataNode on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:26:09] (03CR) 10Alexandros Kosiaris: [V: 032] "http://puppet-compiler.wmflabs.org/1090/ says a noop for iodine, the expected outcome for mendelevium, merging" [puppet] - 10https://gerrit.wikimedia.org/r/249123 (owner: 10Alexandros Kosiaris) [15:26:13] PROBLEM - SSH on analytics1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:26:14] PROBLEM - Hadoop NodeManager on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:26:42] PROBLEM - DPKG on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:29:52] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 1, unused: 0 [15:31:58] 6operations, 10ops-codfw, 10netops: attach mr1-ulsfo to new out of band mgmt link - https://phabricator.wikimedia.org/T116766#1757622 (10RobH) 3NEW a:3Papaul [15:32:41] 6operations, 10ops-codfw, 10netops: attach mr1-codfw to new out of band mgmt link - https://phabricator.wikimedia.org/T116766#1757622 (10RobH) [15:33:02] (03PS1) 10Alexandros Kosiaris: otrs: omit the --force argument to otrs.Daemon.pl [puppet] - 10https://gerrit.wikimedia.org/r/249144 [15:33:50] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10hardware-requests, 5iOS-5-app-production: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1757636 (10Milimetric) To add a little bit to the description, the data collected here is sensitive fr... [15:35:04] (03CR) 10BBlack: "I peeked at the code in https://github.com/wikimedia/mediawiki/blob/4ca4ae9009e8668afbb3b2c3f9701371d7958700/includes/deferred/SquidUpdate" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249121 (https://phabricator.wikimedia.org/T116752) (owner: 10BBlack) [15:35:39] thcipriani: Argh, hey. I'm here but too many false positives meant I didn't see your ping. [15:35:57] 6operations, 7Monitoring: limit the impact of heavy/large graphite queries - https://phabricator.wikimedia.org/T116767#1757637 (10fgiunchedi) 3NEW [15:36:12] James_F: np, ready to SWAT some things? [15:36:17] Sure. [15:37:27] Projet, eh? I'm assuming since that's in 3 places it's legit. [15:37:49] thcipriani: Yup. French. [15:38:10] that adds up. [15:38:23] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248910 (https://phabricator.wikimedia.org/T116603) (owner: 10Jforrester) [15:38:47] (03PS1) 10Andrew Bogott: Add monitoring for the kvm ssl cert, labvirt-star [puppet] - 10https://gerrit.wikimedia.org/r/249147 (https://phabricator.wikimedia.org/T116332) [15:38:49] (03Merged) 10jenkins-bot: Enable VisualEditor in the 'Projet' namespace on the French Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248910 (https://phabricator.wikimedia.org/T116603) (owner: 10Jforrester) [15:39:36] (03CR) 10Ottomata: "Looking good." [puppet] - 10https://gerrit.wikimedia.org/r/248067 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [15:40:12] godog: woudl appreciate review when you find time: https://gerrit.wikimedia.org/r/#/c/248067/ [15:41:26] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable VisualEditor in the "Projet" namespace on the French Wikipedia [[gerrit:248910]] (duration: 00m 17s) [15:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:41:33] ^ James_F check please [15:42:04] (03Abandoned) 10Ottomata: Initial debian packaging [debs/golang-burrow] (debian) - 10https://gerrit.wikimedia.org/r/248245 (https://phabricator.wikimedia.org/T116084) (owner: 10Ottomata) [15:42:05] 6operations, 10ops-codfw, 10netops: attach mr1-codfw to new out of band mgmt link - https://phabricator.wikimedia.org/T116766#1757660 (10Papaul) a:5Papaul>3RobH cable ID 1099 [15:42:18] (03PS2) 10Ottomata: Initial debian packaging [debs/burrow] (debian) - 10https://gerrit.wikimedia.org/r/248342 (https://phabricator.wikimedia.org/T116084) [15:42:34] (03Abandoned) 10Ottomata: Initial debian packaging [debs/burrow] (debian) - 10https://gerrit.wikimedia.org/r/248342 (https://phabricator.wikimedia.org/T116084) (owner: 10Ottomata) [15:42:51] Hmm. [15:42:54] (03CR) 10Alexandros Kosiaris: [C: 032] otrs: omit the --force argument to otrs.Daemon.pl [puppet] - 10https://gerrit.wikimedia.org/r/249144 (owner: 10Alexandros Kosiaris) [15:43:05] papaul: why are all the other patch #s 5 digit and that one 4? [15:43:09] just a bit odd [15:43:17] (just making sure you arent reusing old numbers right?) [15:43:28] thcipriani: It's not broken anything, but it doesn't appear to be working. [15:43:29] This is a 10 ft old cable that i did use [15:43:36] ahh, old copper duh on my part [15:43:39] Robh: if you want i can change the laber [15:43:40] good enough, thank you! [15:43:47] nah it just didnt make sense, and now it does [15:43:51] James_F: lemme double-check that everything is right. [15:43:55] robh:ok [15:44:30] 6operations, 10ops-codfw, 10netops: attach mr1-codfw to new out of band mgmt link - https://phabricator.wikimedia.org/T116766#1757668 (10RobH) 5Open>3Resolved updated gsheet xconnect tracking with the cable #, resolving task. [15:44:42] thcipriani: Aha, it's working. [15:44:47] thcipriani: Must have been cache issue. [15:45:00] 6operations, 10netops, 5Patch-For-Review: setup new equinix out of band mgmt access - https://phabricator.wikimedia.org/T113771#1757673 (10Cmjohnson) Patch ID on mr1 updated fe-0/0/5 up up Transit: James_F: kk, thanks for checking. Next patch! [15:45:18] (03CR) 10Filippo Giunchedi: [C: 031] The varnish reqstats diamond collector does not work, emit to statsd directly instead [puppet] - 10https://gerrit.wikimedia.org/r/248067 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [15:45:32] ottomata: yup, LGTM! [15:46:38] danke! [15:46:46] mobrovac: any joy with mobileapps btw? looks like it still takes >2s or so [15:46:49] ottomata: got a sec re varnishreqstats code? [15:46:58] (03PS2) 10Ottomata: Initial debian packaging [debs/burrow] (debian) - 10https://gerrit.wikimedia.org/r/248344 (https://phabricator.wikimedia.org/T116084) [15:47:16] ottomata: basically I'm wondering about: elif tag in ['RxStatus', 'TxStatus'] and is_valid_http_status(value): [15:47:44] it should just be one or the other, right? or is this relying on the vcl thing being in "client" vs "backend" mode and one run of the script owuld only ever get one of the two [15:47:54] (03PS3) 10ArielGlenn: dumps: admin script to do cleanup, enter maintenance mode, etc [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/234971 [15:48:42] otherwise if you're getting both, wouldn't you double-log e.g. 404 as 1x RxStatus from varnish-be to varnish-fe + 1x TxStatus from varnish-fe to client? (and then sometimes they would differ, if varnish re-interprets the status) [15:49:11] I guess ditto for request method TxRequest and RxRequest [15:49:31] 6operations, 10ops-eqiad: db1030 RAID degraded (disk failed) - https://phabricator.wikimedia.org/T116499#1757716 (10Cmjohnson) 5Open>3Resolved cmjohnson@db1030:~$ sudo megacli -PDList -aALL |grep "Firmware state:" Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun... [15:49:51] ottomata: nevermind! I just hadn't read the code closely enough, I see the backend/client split now [15:50:00] * James_F waits 'patiently' for Jenkins. [15:52:01] (03PS1) 10Cmjohnson: Removing mgmt entries for decom'd server sodium [dns] - 10https://gerrit.wikimedia.org/r/249148 [15:53:04] (03CR) 10BryanDavis: "godog wrote:" [puppet] - 10https://gerrit.wikimedia.org/r/224829 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis) [15:53:12] (03CR) 10Cmjohnson: [C: 032] Removing mgmt entries for decom'd server sodium [dns] - 10https://gerrit.wikimedia.org/r/249148 (owner: 10Cmjohnson) [15:53:26] bblack...uhhh, ok! :) [15:53:35] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1757745 (10Jgreen) > @Jgreen, most of that looks right. I didn't know about the landingpages.tsv.log bit--is that... [15:54:09] 6operations, 10RESTBase-Cassandra, 7Monitoring: service_checker - https://phabricator.wikimedia.org/T116770#1757748 (10fgiunchedi) 3NEW [15:54:36] 6operations, 10RESTBase-Cassandra, 7Monitoring: service_checker reports success even on endpoints timing out - https://phabricator.wikimedia.org/T116770#1757757 (10fgiunchedi) [15:55:11] (03PS4) 10BBlack: The varnish reqstats diamond collector does not work, emit to statsd directly instead [puppet] - 10https://gerrit.wikimedia.org/r/248067 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [15:56:40] PROBLEM - check_puppetrun on beryllium is CRITICAL: CRITICAL: Puppet has 40 failures [15:56:43] COOL lets' merge that, will stop screened process on 1057 [15:56:50] (03CR) 10Ottomata: [C: 032] The varnish reqstats diamond collector does not work, emit to statsd directly instead [puppet] - 10https://gerrit.wikimedia.org/r/248067 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [15:59:00] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: puppet fail [15:59:05] ottomata: I still think there's a dependency issue, but I haven't quite sorted it out [15:59:20] PROBLEM - NTP on analytics1039 is CRITICAL: NTP CRITICAL: No response from NTP server [15:59:31] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1757785 (10Jgreen) We need to get this cut over ASAP, as it is blocking Tech Ops in several important ways. [15:59:58] ottomata: basically, I think your base::service_unit for reqstats needs service_params => { require => Service[varnish-frontend] } or whatever the instance name is, and the systemd unit file needs the After= line also copied to a Require= line [16:00:04] _joe_ andrewbogott: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151027T1600). [16:00:04] Krenair: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:18] omg, will I never not be on puppet swat duty? [16:00:28] otherwise reqstats will either fail to start because varnishd isn't up yet, and/or it will jump over puppet's dependency ordering and indirectly start varnish early during initial puppetization, one way or anothewr [16:00:34] (03PS1) 10Ottomata: Fix param name for reqstats [puppet] - 10https://gerrit.wikimedia.org/r/249150 [16:00:51] oh bblack, ok [16:01:01] copied that from another, but I am sure you are right [16:01:02] thcipriani: I'm here to test if you want to scap. [16:01:16] James_F: yup, just lining up the sync [16:01:20] :-) [16:01:22] Thanks. [16:01:40] PROBLEM - check_puppetrun on beryllium is CRITICAL: CRITICAL: Puppet has 40 failures [16:01:43] hm, bblack, can I put the require on the define instead of in it? [16:01:44] like [16:01:54] varnish::logging::reqstats { 'frontend': [16:01:54] require => Varnish::Instance['misc'], [16:01:58] (or frontend, orwhatever) [16:02:06] that way the whole define won't happen unless that does? [16:02:09] Krenair: despite what the bot says, jynus and Coren are on puppet swat duty today [16:02:09] !log thcipriani@tin Synchronized php-1.27.0-wmf.3/resources/src/mediawiki/mediawiki.ForeignStructuredUpload.BookletLayout.js: SWAT: mw.ForeignStructuredUpload: Mark description as being in source wikis content language [[gerrit:249081]] (duration: 00m 17s) [16:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:02:14] that way I don't have ot pass the service_param required into the define as another parm [16:02:15] ^ James_F check please [16:02:20] thcipriani: Checking. [16:02:26] andrewbogott, umm... okay [16:02:27] The bot, she lies? [16:02:27] I can only hope that someone else will repair the bot’s misconceptions [16:02:36] I have no idea how. [16:02:59] ottomata: sure whichever way works. but since reqstats won't work without the instance, it seems like the require should always be there inside of it. We already have params that know the varnish::instance instance name right? [16:03:02] it ought to come straight from https://wikitech.wikimedia.org/wiki/Deployments which is… correct [16:03:04] the bot has lag [16:03:09] andrewbogott: I got a bead on the labvirt1010/1011 situation [16:03:20] chasemp: yeah? Do tell [16:03:21] https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=197512&oldid=197387 [16:03:29] hm, yeah, instance name should be enough. hm [16:03:31] chasemp: (I’m in a meeting so only 20% present here) [16:04:20] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: puppet fail [16:04:23] ottomata: well I guess you can construct it the same way as the local $service_unit_name [16:04:45] also: Oct 27 15:59:31 cp1056 puppet-agent[6133]: Could not retrieve catalog from remote server: Error 400 on SERVER: Invalid parameter metric_path at /etc/puppet/modules/role/manifests/cache/misc.pp:219 on node cp1056.eqiad.wmnet [16:04:52] yeah [16:04:57] am fixing [16:05:02] RECOVERY - configured eth on lvs1008 is OK: OK - interfaces up [16:05:02] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [16:05:21] bblack, like this? [16:05:21] service_params => { [16:05:21] require => Service["varnish-${instance_name}"] [16:05:47] robh: good morning :-} Seems scandium can be moved to labs-support and installed now ! chase assigned the task to you https://phabricator.wikimedia.org/T95046 [16:05:55] and [16:05:59] Krenair: All the patches are yours? :-) [16:06:00] Require=varnish<%= /\w/.match(@instance_name) ? "-#{@instance_name}" : '' -%>.service [16:06:00] (03PS4) 10ArielGlenn: dumps: admin script to do cleanup, enter maintenance mode, etc [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/234971 [16:06:01] ? [16:06:20] yep [16:06:21] robh: will probably fill another task to remove the SSDs from labnodepool1001.eqiad.wmnet since we have no more any use for them . But that is a different story [16:06:40] RECOVERY - check_puppetrun on beryllium is OK: OK: Puppet is currently enabled, last run 68 seconds ago with 0 failures [16:06:40] thcipriani: Yup, it works great. [16:06:50] James_F: awesome! Thanks for checking. [16:07:07] thcipriani: Of course. Thank you for deploying! [16:07:27] oh hmm, althouhg, if no -frontend, hm [16:07:37] won't be $isntance name is undef [16:07:37] hm [16:07:42] hence $instancesuffix [16:07:43] ah [16:07:45] while I get puppet swat, I don't get the exactly what falls in the params of it... hm [16:07:46] hm [16:07:48] !log round of fundraising OS updates, occasional icinga noise is expected [16:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:08:12] ottomata: yeah, so it's a lot like your existing $service_unit_name [16:08:47] except the default case is "varnish", and if the instance name is defined it's "varnish-${instance_name}" [16:09:24] (03CR) 10Jcrespo: [C: 031] Rename mediawiki::web::sites to mediawiki::web::prod_sites to make room for a new generic sites.pp [puppet] - 10https://gerrit.wikimedia.org/r/244228 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [16:09:27] Krenair: I'd have really liked it if 248634 had some blessing from _joe_. It seems okay to me in principle, but I'm not sure if that has unexpected side effects. [16:10:05] Yes, I listed these under the understanding that _joe_ and andrewbogott were doing this [16:10:12] Ah! [16:10:21] you can leave it [16:10:27] Some I can okay though. [16:10:27] godog: not yet, sorry, got preempted with other stuff, will look at it asap [16:10:34] Coren: on the off chance... would something like https://gerrit.wikimedia.org/r/#/c/244806/ be puppet swat-able? [16:10:57] JohnFLewis: Yes. [16:11:13] awesome. let me see if it rebases... [16:11:18] (03PS2) 10John F. Lewis: monitoring: append sms to contact groups, don't override with admins,sms [puppet] - 10https://gerrit.wikimedia.org/r/244806 [16:11:27] (03CR) 10BBlack: [C: 032] Fix param name for reqstats [puppet] - 10https://gerrit.wikimedia.org/r/249150 (owner: 10Ottomata) [16:11:43] JohnFLewis: plz to add to https://wikitech.wikimedia.org/wiki/Deployments [16:11:58] Coren: have the page in edit right now :) [16:12:27] (03PS1) 10Filippo Giunchedi: cassandra: add eqiad test cluster multiple instances [dns] - 10https://gerrit.wikimedia.org/r/249152 (https://phabricator.wikimedia.org/T95253) [16:12:42] hashar: ok, its behind a few other allocations on my list today but now is on radar [16:12:45] hashar: jessie? [16:12:47] or trusty? [16:12:49] Jessie [16:12:59] (03PS4) 10coren: dynamicproxy: Make blocked user agents configurable [puppet] - 10https://gerrit.wikimedia.org/r/246125 (https://phabricator.wikimedia.org/T90844) (owner: 10Alex Monk) [16:13:15] 6operations, 5Continuous-Integration-Scaling: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1757852 (10hashar) Should use the Jessie operating system. [16:13:19] cool [16:13:26] (03PS1) 10Ottomata: Fix varnishreqstats dependency on varnish service [puppet] - 10https://gerrit.wikimedia.org/r/249153 [16:13:27] Coren: added! [16:13:30] 6operations, 5Continuous-Integration-Scaling: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1757855 (10RobH) [16:13:41] bblack ^^^ [16:13:57] (03CR) 10coren: [C: 032] "Straightforward." [puppet] - 10https://gerrit.wikimedia.org/r/246125 (https://phabricator.wikimedia.org/T90844) (owner: 10Alex Monk) [16:15:02] let's coordinate a bit here [16:15:11] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:15:37] (03CR) 10Joal: [C: 04-1] "Parameter is not correct milimetric :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/247458 (https://phabricator.wikimedia.org/T114379) (owner: 10Milimetric) [16:15:44] (03CR) 10BBlack: [C: 031] Fix varnishreqstats dependency on varnish service [puppet] - 10https://gerrit.wikimedia.org/r/249153 (owner: 10Ottomata) [16:15:54] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure, 3labs-sprint-118: Rack/Setpup labvirt1010 and 1011 - https://phabricator.wikimedia.org/T116019#1757874 (10chasemp) So we have had labvirt1010 installing for awhile now, but labvirt1011 has been a source of difficulty. Same box type, same ilo, same... [16:15:55] jynus: Sorry, I thought you were otherwise occupied. [16:16:02] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure, 3labs-sprint-118: Rack/Setpup labvirt1010 and 1011 - https://phabricator.wikimedia.org/T116019#1757875 (10chasemp) [16:16:21] no, I was reviewing the changes, I am not familiar with all the classes [16:16:35] 6operations, 10netops, 5Patch-For-Review: setup new equinix out of band mgmt access - https://phabricator.wikimedia.org/T113771#1757879 (10RobH) 5Open>3Resolved Sheet updated [16:16:54] I went for the labs-related ones first as this is the set I am most familiar with. :-) [16:17:00] ok :-) [16:17:31] basically, let's merge just once [16:17:45] instead of 7 times [16:18:07] ...? I'd rather do a merge-per-patch because if something unexpectedly breaks we know exactly what? [16:18:34] jynus: If you feel comfortable with an omnibus merge, it's okay with me. [16:18:54] ok, then [16:19:01] but there is a dependency [16:19:05] let's do it in order [16:19:37] (03PS4) 10Jcrespo: Rename mediawiki::web::sites to mediawiki::web::prod_sites to make room for a new generic sites.pp [puppet] - 10https://gerrit.wikimedia.org/r/244228 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [16:20:00] RECOVERY - configured eth on lvs1007 is OK: OK - interfaces up [16:20:26] (03CR) 10Jcrespo: [C: 032] Rename mediawiki::web::sites to mediawiki::web::prod_sites to make room for a new generic sites.pp [puppet] - 10https://gerrit.wikimedia.org/r/244228 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [16:20:40] (03PS2) 10Filippo Giunchedi: cassandra: add eqiad test cluster multiple instances [dns] - 10https://gerrit.wikimedia.org/r/249152 (https://phabricator.wikimedia.org/T95253) [16:20:48] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add eqiad test cluster multiple instances [dns] - 10https://gerrit.wikimedia.org/r/249152 (https://phabricator.wikimedia.org/T95253) (owner: 10Filippo Giunchedi) [16:21:25] 6operations, 10OTRS: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#1757895 (10akosiaris) I 've upgraded the test installation today to OTRS version 5.0.1. There is one thing that has not been upgraded to version 5 and that is the QuickClose functionality that is provid... [16:21:33] (03PS3) 10Giuseppe Lavagetto: wikitech: make private settings file writable by owner [puppet] - 10https://gerrit.wikimedia.org/r/249076 (https://phabricator.wikimedia.org/T87036) [16:22:23] (03CR) 10Giuseppe Lavagetto: [C: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/249076 (https://phabricator.wikimedia.org/T87036) (owner: 10Giuseppe Lavagetto) [16:24:03] (03PS4) 10Rush: phabricator: Set security ext tag for labs [puppet] - 10https://gerrit.wikimedia.org/r/248646 (https://phabricator.wikimedia.org/T104904) (owner: 10Negative24) [16:24:40] (03CR) 10Rush: [C: 032 V: 032] phabricator: Set security ext tag for labs [puppet] - 10https://gerrit.wikimedia.org/r/248646 (https://phabricator.wikimedia.org/T104904) (owner: 10Negative24) [16:24:44] 7Blocked-on-Operations, 6operations, 10Beta-Cluster-Infrastructure, 7HHVM, 5Patch-For-Review: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1757899 (10Joe) The ldap settings file has now permissions that should allow scap not to choke on it. How... [16:24:52] 6operations, 10ops-eqiad, 10netops: asw-d-eqiad SNMP failures - https://phabricator.wikimedia.org/T112781#1757901 (10BBlack) So we've got lvs1007 hooked up now, just that 1/4 of these interfaces cabled with an SFP and enabled. I guess we wait a bit and see if there's any SNMP fallout before trying more port... [16:26:00] (03CR) 10Rush: [C: 031] "looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/249060 (owner: 10Dzahn) [16:26:01] PROBLEM - puppet last run on mw2167 is CRITICAL: CRITICAL: puppet fail [16:26:11] PROBLEM - puppet last run on mw1240 is CRITICAL: CRITICAL: puppet fail [16:26:33] is that me^ [16:26:50] PROBLEM - Router interfaces on mr1-codfw is CRITICAL: CRITICAL: host 208.80.153.196, interfaces up: 33, down: 1, dormant: 0, excluded: 0, unused: 0BRge-0/0/7: down - Transit: ! CyrusOne OOB (IP-000008-01) {#1099} [1Gbps Cu]BR [16:26:54] _joe_: thx for the g+w on wikitech privatesettings. [16:26:55] * Coren checks. [16:27:07] (03CR) 10BBlack: "Well, responding to myself: @ottomata 's in the process of hooking up varnish -> statsd -> graphite for all responses and such now, so as " [puppet] - 10https://gerrit.wikimedia.org/r/241643 (owner: 10BBlack) [16:27:32] yes it is [16:28:26] but I think it is a glitch? [16:28:58] it is succeding now [16:29:02] 7Blocked-on-Operations, 6operations, 10Beta-Cluster-Infrastructure, 7HHVM, 5Patch-For-Review: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1757919 (10bd808) >>! In T87036#1757899, @Joe wrote: > The ldap settings file has now permissions that sho... [16:29:37] (03PS2) 10Ottomata: Fix varnishreqstats dependency on varnish service [puppet] - 10https://gerrit.wikimedia.org/r/249153 [16:29:44] (03CR) 10Ottomata: [C: 032] Fix varnishreqstats dependency on varnish service [puppet] - 10https://gerrit.wikimedia.org/r/249153 (owner: 10Ottomata) [16:29:46] I thought puppet was transactionl [16:29:51] (03CR) 10Ottomata: [V: 032] Fix varnishreqstats dependency on varnish service [puppet] - 10https://gerrit.wikimedia.org/r/249153 (owner: 10Ottomata) [16:29:52] RECOVERY - puppet last run on mw1240 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [16:30:11] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:30:14] jynus: eventually consistent! [16:30:20] cache related? [16:30:41] <_joe_> ostriches: do we want to finish up this damn migration or not? :P [16:30:58] Indeed! Can I get a poke on the 2 puppet changes we need then? [16:31:12] (03PS5) 10ArielGlenn: dumps: admin script to do cleanup, enter maintenance mode, etc [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/234971 [16:31:17] (03PS3) 10Ottomata: Update camus runs [puppet] - 10https://gerrit.wikimedia.org/r/249100 (https://phabricator.wikimedia.org/T113252) (owner: 10Joal) [16:31:28] <_joe_> ostriches: which ones? [16:31:34] https://gerrit.wikimedia.org/r/#/c/224829/ and its parent [16:31:34] <_joe_> I might have missed those [16:31:40] RECOVERY - puppet last run on mw2167 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:31:47] coren, I am not sure about the next one: gerrit:244237 [16:33:02] in particular, the priority on several classes [16:33:32] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 1, unused: 0BRxe-0/0/2: down - Transit: ! NTT (service ID 234631) {#1061} [10Gbps]BR [16:33:41] <_joe_> ostriches: we were also talking about generating the dsh list from etcd data sooner than later [16:33:42] jynus: I don't like it for swat. It touches too many things imo and needs deeper review. [16:33:57] _joe_: Yeah, we could possibly go that route too, the child is less important. [16:34:02] the change is trivial [16:34:09] I just would like to test it [16:34:15] (03PS1) 10Ottomata: Set instance name specifically for varnish misc [puppet] - 10https://gerrit.wikimedia.org/r/249157 [16:34:35] (03CR) 10Ottomata: [C: 032 V: 032] Set instance name specifically for varnish misc [puppet] - 10https://gerrit.wikimedia.org/r/249157 (owner: 10Ottomata) [16:34:37] jynus: That requires commit the change to the puppet compiler though afaict. [16:35:03] Ah, or maybe not - since we did merge the parent patch. [16:35:18] jynus: puppet isn't really transactional in that sense. if you deploy changes in 2 files that depend on each other (e.g. require a new class in one file, and create the classfile for it in the same commit), sometimes strontium + palladium view will be out of sync for a few clients as a race condition, and some will fail not finding the new class [16:35:21] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 1, unused: 0 [16:35:31] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: puppet fail [16:35:42] bblack, no need to tell me, I just suffered it [16:35:50] (03PS2) 10Giuseppe Lavagetto: role::deployment: move to role module [puppet] - 10https://gerrit.wikimedia.org/r/249090 [16:35:55] :-) [16:36:26] <_joe_> also called the "puppet is shit" design pattern [16:36:32] he he [16:36:32] 7Blocked-on-Operations, 6operations, 5Continuous-Integration-Scaling: Backport python-os-client-config 1.3.0-1 from Debian Sid to jessie-wikimedia - https://phabricator.wikimedia.org/T104967#1757948 (10fgiunchedi) I believe package_builder will try importing from jessie-wikimedia if WIKIMEDIA=yes is passed o... [16:37:05] 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1757949 (10RobH) Ok so Dell PowerEdge R420, Dual Intel Xeon E5-2440, 32GB Memory, Dual 300GB SSD, Dual 500GB Nearline SAS also include H310. Since we'll have to swap a machine out to one o... [16:37:05] ottomata: your previous patch is generally ok, the problem is the "misc" cluster doesn't have a "frontend" instance. its default/unnamed instance is the "frontend" for stats purposes (something I'm trying to address, but patches not ready yet) [16:37:34] which may explain a change I that failed to me recently, and I reverted it probably too soon, but made no sense [16:38:14] 6operations, 10ops-eqiad: label server lawrencium / wmf3542 & swap H310 for H710 controller - https://phabricator.wikimedia.org/T116776#1757955 (10RobH) 3NEW a:3Cmjohnson [16:41:57] (03PS3) 10Filippo Giunchedi: cassandra: switch to monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/249082 [16:42:03] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: switch to monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/249082 (owner: 10Filippo Giunchedi) [16:42:08] (03PS2) 10Jcrespo: dynamicproxy: Empty data from initial-data.db [puppet] - 10https://gerrit.wikimedia.org/r/248622 (owner: 10Alex Monk) [16:43:04] I do not like the "having blobs" on puppet, but I suppose it now better than before [16:43:27] (03CR) 10Jcrespo: [C: 032] dynamicproxy: Empty data from initial-data.db [puppet] - 10https://gerrit.wikimedia.org/r/248622 (owner: 10Alex Monk) [16:44:52] also, would people identify that as an sqlite file? [16:45:13] ottomata: actually something else is crazy with the $instance_name thing, but I don't know what yet. Either puppet is considering undef==true, or defaulting a classparam to $name makes it impossible to override it with undef as a param? either way, what ends up happening is ::reqstats thinks instance_name is the same as $name "frontend", rather than undef [16:46:12] my guess would be for: class foo($bar=$name) .... foo { "asdf": bar => undef } results in the undef defaulting back to $name [16:47:25] <_joe_> bblack: "Either puppet is considering undef==true" is definitely not possible [16:47:43] <_joe_> but you might have done things like testing for undef when it's actually nil [16:48:00] no, it's explicitly undef, and it's just a boolean "if $bar" check [16:48:08] I think it's what I said above with "my guess" [16:48:37] you can't override a defaulted-to-$name classparam with "undef", as explicit undef triggers the defaulting [16:48:52] (03PS1) 10coren: Revert "dynamicproxy: Make blocked user agents configurable" [puppet] - 10https://gerrit.wikimedia.org/r/249159 [16:49:59] (03PS2) 10coren: Revert "dynamicproxy: Make blocked user agents configurable" [puppet] - 10https://gerrit.wikimedia.org/r/249159 [16:50:25] jynus: Reverthing 249159 which behaves oddly. [16:50:55] (03CR) 10coren: [C: 032] "Revert." [puppet] - 10https://gerrit.wikimedia.org/r/249159 (owner: 10coren) [16:51:04] (03PS1) 10RobH: Setting dns entries for scandium [dns] - 10https://gerrit.wikimedia.org/r/249160 [16:51:31] Coren, are you sure it is yours and not mine? [16:51:43] (03CR) 10RobH: [C: 032] Setting dns entries for scandium [dns] - 10https://gerrit.wikimedia.org/r/249160 (owner: 10RobH) [16:51:45] jynus: Yes, I checked in the actual output config. [16:51:48] it shouldn't, but I want to be sure [16:51:50] ok [16:52:31] jynus: It applies cleanly, but one of the parameters doesn't seem to be substituted properly in a template for unclear reasons. It'll need further investigation. [16:52:40] the self certificate thing, I will not aprove without bblack's +1, I will block on that [16:52:51] what self certificate thing? [16:52:58] I do not see consensus on the ticket either [16:53:16] bblack, https://gerrit.wikimedia.org/r/#/c/247587/ [16:53:32] (03PS1) 10BBlack: varnishreqstats: fix instance_name corner case stuff [puppet] - 10https://gerrit.wikimedia.org/r/249161 [16:53:49] (03PS1) 10RobH: setting lawrencium dns entries [puppet] - 10https://gerrit.wikimedia.org/r/249162 [16:53:59] (03PS2) 10RobH: setting lawrencium dns entries [puppet] - 10https://gerrit.wikimedia.org/r/249162 [16:54:16] jynus: if that's in puppetswat, let's hold it off. I haven't had time to dig deep on that, but it's not trivial what's going on there, or necessarily correct. [16:54:27] +1 on your comment, bblack [16:55:24] Krenair, I think you are too ambitious for puppetswat [16:55:26] (03CR) 10RobH: [C: 032] setting lawrencium dns entries [puppet] - 10https://gerrit.wikimedia.org/r/249162 (owner: 10RobH) [16:55:28] (03CR) 10BBlack: "@ottomata feel free to merge this if it looks ok to you, I think it will fix cp1056 misc-cluster testcase, which is failing puppet right n" [puppet] - 10https://gerrit.wikimedia.org/r/249161 (owner: 10BBlack) [16:55:28] :-) [16:55:41] (03CR) 10JanZerebecki: [C: 031] "I confused something. Note https://wiki.mozilla.org/Security/Server_Side_TLS has 3des enabled for intermediate." [puppet] - 10https://gerrit.wikimedia.org/r/249017 (owner: 10BBlack) [16:55:53] heh [16:57:45] I think we are all having issues with parameters [16:58:43] jynus: I have no issues with 244806 [16:58:50] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: Connection timed out [16:59:05] Also ^^ etherpad is esplodey again. [16:59:21] * DanielK_WMDE_ is sad about poor etherpad [17:00:24] Oh, sorry, I was confused [17:00:39] let me add it to the wiki [17:01:25] Coren, technically, I am ok, but is it something we want? [17:01:29] (03PS3) 10Milimetric: Aggregate from projectviews-*, not projectcounts-* [puppet] - 10https://gerrit.wikimedia.org/r/247458 (https://phabricator.wikimedia.org/T114379) [17:02:24] jynus: My understanding is that this is what we agreed at the offsite. "If you own it, you get paged for it" [17:02:28] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 1, unused: 0BRxe-0/0/2: down - Transit: ! NTT (service ID 234631) {#1061} [10Gbps]BR [17:02:56] Coren, of course I am ok with it, I am more worried about cost, etc. [17:02:58] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 1.002 second response time [17:03:05] (03PS4) 10Milimetric: Aggregate from projectviews-*, not projectcounts-* [puppet] - 10https://gerrit.wikimedia.org/r/247458 (https://phabricator.wikimedia.org/T114379) [17:03:17] 6operations: install/setup/deploy lawrencium as eqiad salt-master - https://phabricator.wikimedia.org/T116645#1758166 (10RobH) [17:03:20] didn rob say that we run recently out of credits? [17:03:44] (03PS1) 10RobH: lawrencium uses h710 controller hw raid [puppet] - 10https://gerrit.wikimedia.org/r/249165 [17:03:45] i fixed that in the meeting [17:03:48] (although I should be partially responsable for that) [17:03:50] jynus: Yes, but that's because we had very little in bank w/ the new provider and they since refilled. [17:03:52] :-( [17:03:53] jynus: we ran out of credit and then renewed them immediately [17:03:59] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 1, unused: 0 [17:04:01] what Coren said =] [17:04:11] ... [17:04:16] hrmm [17:04:17] so I suppose no issue with the change, robh ? [17:04:24] etherpad seems happy again [17:04:24] ? [17:04:36] jynus: sorry, I just know we had credits, I dunno what change you mean [17:04:59] (03PS5) 10Milimetric: Aggregate from projectviews-*, not projectcounts-* [puppet] - 10https://gerrit.wikimedia.org/r/247458 (https://phabricator.wikimedia.org/T114379) [17:05:02] ottomata: could you please merge https://gerrit.wikimedia.org/r/247458 [17:05:14] (03CR) 10Andrew Bogott: [C: 032] "the puppet compiler likes this just fine." [puppet] - 10https://gerrit.wikimedia.org/r/248882 (owner: 10Rush) [17:05:28] (03PS2) 10Andrew Bogott: openstack: cleanup up old repo setups [puppet] - 10https://gerrit.wikimedia.org/r/248882 (owner: 10Rush) [17:05:46] (03CR) 10Krinkle: [C: 031] Checkout instead of cherry-pick [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/248634 (owner: 10Alex Monk) [17:07:42] (03CR) 10Krinkle: "This means it may not incorporate latest change from the target branch, however." [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/248634 (owner: 10Alex Monk) [17:07:57] Coren, +1 to that [17:08:07] okay, will merge [17:08:19] (03PS3) 10coren: monitoring: append sms to contact groups, don't override with admins,sms [puppet] - 10https://gerrit.wikimedia.org/r/244806 (owner: 10John F. Lewis) [17:08:38] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:08:39] hm? I see a lot of paging discussion then my patch, what have I missed? [17:09:40] (03PS2) 10RobH: lawrencium uses h710 controller hw raid [puppet] - 10https://gerrit.wikimedia.org/r/249165 [17:09:46] (03CR) 10RobH: [C: 032] lawrencium uses h710 controller hw raid [puppet] - 10https://gerrit.wikimedia.org/r/249165 (owner: 10RobH) [17:10:23] Is anyone looking into the etherpad issue? It's hiding our meeting notes :-( [17:10:31] 7Blocked-on-Operations, 6operations, 5Continuous-Integration-Scaling: Backport python-os-client-config 1.3.0-1 from Debian Sid to jessie-wikimedia - https://phabricator.wikimedia.org/T104967#1758179 (10hashar) `WIKIMEDIA=yes` does not include either `jessie-backports` or `jessie-wikimedia/thirdparty`. We wo... [17:10:58] andrewbogott: i just merged your puppet chagne [17:11:00] cuz i had one also [17:11:07] robh: thanks, sorry [17:11:09] (just so you dont think you are insane when its not there ;) [17:11:21] no worries [17:12:18] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 6.098 second response time [17:14:00] (03CR) 10coren: [C: 032] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/244806 (owner: 10John F. Lewis) [17:14:30] * Coren headdesks. [17:14:45] "Please rebase the change locally and upload again for review." [17:15:00] Coren: no merge 4 u [17:15:03] JohnFLewis: Do you have it locally still? [17:15:19] Coren: nope [17:15:20] 6operations, 10Datasets-General-or-Unknown, 5Patch-For-Review: Add App Guidelines on Dumps Page - https://phabricator.wikimedia.org/T110742#1758222 (10VBaranetsky) Thanks! [17:16:08] (03PS4) 10coren: monitoring: append sms to contact groups, don't override with admins,sms [puppet] - 10https://gerrit.wikimedia.org/r/244806 (owner: 10John F. Lewis) [17:16:58] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 1, unused: 0BRxe-0/0/2: down - Transit: ! NTT (service ID 234631) {#1061} [10Gbps]BR [17:17:11] jynus: Merged. Was that our last? [17:17:49] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:19:38] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 7.315 second response time [17:20:56] bblack (sorry, was eating lunch), i saw that too, except...the systemd file isn't right [17:21:14] jynus, so what about https://gerrit.wikimedia.org/r/#/c/244237/ and https://gerrit.wikimedia.org/r/#/c/247587/ ? [17:21:16] its not updated with the change, so it think its still using the old patch.. [17:21:17] mabye. [17:21:33] at an interview, unavailable, sorry [17:21:38] Coren [17:22:05] (03PS8) 10Chad: scap: Add co-master configuration [puppet] - 10https://gerrit.wikimedia.org/r/224829 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis) [17:22:07] (03PS6) 10Chad: Move dsh code into scap where it belongs [puppet] - 10https://gerrit.wikimedia.org/r/247304 [17:22:19] Krenair: Both need more review eyes than is apropriate for a swat deploy. bblack in particular for the SSL one. [17:22:20] _joe_: Rebased ^ :) [17:23:07] 6operations, 10Dumps-Generation, 6Labs, 10wikitech.wikimedia.org: Provide dumps of wikitech.wikimedia.org - https://phabricator.wikimedia.org/T54170#1758255 (10ArielGlenn) [17:23:37] (03PS2) 10Ottomata: varnishreqstats: fix instance_name corner case stuff [puppet] - 10https://gerrit.wikimedia.org/r/249161 (owner: 10BBlack) [17:23:43] (03CR) 10Ottomata: [C: 032 V: 032] varnishreqstats: fix instance_name corner case stuff [puppet] - 10https://gerrit.wikimedia.org/r/249161 (owner: 10BBlack) [17:23:48] (03Abandoned) 10Chad: Move web::sites to web::prod_sites; begin unification in new class [puppet] - 10https://gerrit.wikimedia.org/r/197655 (owner: 10Chad) [17:23:59] Coren, the beta SSL change needs more eyes? [17:24:39] Krenair: Yes. bblack> jynus: if that's in puppetswat, let's hold it off. I haven't had time to dig deep on that, but it's not trivial what's going on there, or necessarily correct. [17:24:44] well, I guess it does change some of the tlsproxy stuff... [17:25:07] 6operations, 10ops-eqiad: Decommission sodium - https://phabricator.wikimedia.org/T110142#1758267 (10Cmjohnson) removed mgmt dns https://gerrit.wikimedia.org/r/#/c/249148/ [17:25:07] So bug Brandon. :-) [17:25:10] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:25:39] 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1758270 (10Cmjohnson) [17:25:40] 6operations, 10ops-eqiad: label server lawrencium / wmf3542 & swap H310 for H710 controller - https://phabricator.wikimedia.org/T116776#1758268 (10Cmjohnson) 5Open>3Resolved Completed [17:26:19] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:26:48] bblack? Oct 27 17:24:30 cp1056 systemd[1]: [/lib/systemd/system/varnishreqstats-default.service:4] Unknown lvalue 'Require' in section 'Unit' [17:26:48] (03PS1) 10Rush: labvirt: setup for labvirt1011 install [puppet] - 10https://gerrit.wikimedia.org/r/249173 [17:26:59] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 6.662 second response time [17:28:24] 6.662 second response time for a 301 counts as a recovery? [17:28:56] Is it just me, or is etherpad not really working? [17:29:10] (03PS1) 10Ottomata: Add env python at top of varnishreqstats script [puppet] - 10https://gerrit.wikimedia.org/r/249174 [17:29:13] hm, maybe it sok. might be an old message [17:29:14] (03PS1) 10ArielGlenn: copy pagecounts-al-sites files over to labs from datasets [puppet] - 10https://gerrit.wikimedia.org/r/249175 (https://phabricator.wikimedia.org/T93317) [17:29:26] (03CR) 10Ottomata: [C: 032 V: 032] Add env python at top of varnishreqstats script [puppet] - 10https://gerrit.wikimedia.org/r/249174 (owner: 10Ottomata) [17:31:22] ottomata: yeah my bad, it's Requires= not Require= [17:31:34] ah k [17:32:01] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:34:39] !log ori@tin Synchronized php-1.27.0-wmf.3/resources/src/mediawiki.ui: I54c195541: Get rid of CSS transitions on form elements in mediawiki.ui (duration: 00m 17s) [17:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:34:48] (03PS1) 10Ottomata: Use Requires, not Require, for varnishreqstats systemd [puppet] - 10https://gerrit.wikimedia.org/r/249177 [17:35:01] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 1.142 second response time [17:35:04] (03CR) 10Ottomata: [C: 032 V: 032] Use Requires, not Require, for varnishreqstats systemd [puppet] - 10https://gerrit.wikimedia.org/r/249177 (owner: 10Ottomata) [17:35:49] 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1758327 (10mark) Er, approvals? :) Also, can't this be a VM in ganeti? [17:36:47] (03PS4) 10Alex Monk: Begin to merge production and beta apache config, starting with nonexistent.conf [puppet] - 10https://gerrit.wikimedia.org/r/244237 (https://phabricator.wikimedia.org/T86644) [17:37:56] (03PS1) 10Ottomata: Use force => true for /usr/share/diamond/collectors/$name diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/249178 [17:37:58] (03PS1) 10Papaul: Add production DNS entries for ms-be20[1-2][0-6] Bug:T114712 [dns] - 10https://gerrit.wikimedia.org/r/249179 (https://phabricator.wikimedia.org/T114712) [17:38:20] (03CR) 10Ottomata: [C: 032 V: 032] Use force => true for /usr/share/diamond/collectors/$name diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/249178 (owner: 10Ottomata) [17:40:03] !log krinkle@tin Synchronized php-1.27.0-wmf.3/extensions/NavigationTiming/modules/ext.navigationTiming.js: I95db9deefe363a65 (duration: 00m 17s) [17:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:40:12] 6operations, 5Patch-For-Review: install / setup tungsten for temp use - wikimania 2015 video transcoding - https://phabricator.wikimedia.org/T106563#1758383 (10ArielGlenn) did these videos get transcoded already? or is this host still needed, I wonder? [17:40:40] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:42:14] ok, looking happier now bblack, thanks [17:42:16] 6operations: install/setup/deploy lawrencium as eqiad salt-master - https://phabricator.wikimedia.org/T116645#1758390 (10RobH) [17:42:22] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 5.190 second response time [17:42:39] (03PS1) 10Andrew Bogott: Make labvirt1011 a compute node. [puppet] - 10https://gerrit.wikimedia.org/r/249181 [17:42:43] chasemp: ^ [17:42:59] andrewbogott: cool mind if I merge? [17:43:04] (03PS2) 10Rush: Make labvirt1011 a compute node. [puppet] - 10https://gerrit.wikimedia.org/r/249181 (owner: 10Andrew Bogott) [17:43:14] chasemp: not at all [17:44:12] (03CR) 10Rush: [C: 032] Make labvirt1011 a compute node. [puppet] - 10https://gerrit.wikimedia.org/r/249181 (owner: 10Andrew Bogott) [17:45:05] (03PS1) 10Alex Monk: Revert "Revert "dynamicproxy: Make blocked user agents configurable"" [puppet] - 10https://gerrit.wikimedia.org/r/249182 [17:48:13] (03PS2) 10Alex Monk: Revert "Revert "dynamicproxy: Make blocked user agents configurable"" [puppet] - 10https://gerrit.wikimedia.org/r/249182 [17:49:00] 10Ops-Access-Requests, 6operations, 10Wikidata: Requesting access to researchers & statistics-users for Addshore - https://phabricator.wikimedia.org/T116784#1758419 (10Addshore) 3NEW [17:49:06] 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1758428 (10ArielGlenn) No, this should not be a VM. It should be a dedicated server. [17:50:06] !log add fundraising-banner-logger hosts to icinga/nsca [17:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:50:36] 10Ops-Access-Requests, 6operations, 10Wikidata: Requesting access to researchers & statistics-users for Addshore - https://phabricator.wikimedia.org/T116784#1758441 (10Krenair) Both? Only researchers is required AFAIK [17:52:48] (03PS1) 10Andrew Bogott: Improved the 'buggy kernel' error message for labvirt nodes. [puppet] - 10https://gerrit.wikimedia.org/r/249184 [17:52:57] chasemp: ^ should help. [17:53:57] (03CR) 10Rush: [C: 032] Improved the 'buggy kernel' error message for labvirt nodes. [puppet] - 10https://gerrit.wikimedia.org/r/249184 (owner: 10Andrew Bogott) [17:54:13] kk [17:55:59] 7Puppet, 6Labs, 6Phabricator, 5Patch-For-Review: On labs phabricator references security extension even though it isn't present - https://phabricator.wikimedia.org/T104904#1758510 (10Negative24) [17:56:21] !log updating jessie debian-installer to 20150422+deb8u2 [17:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:56:49] 6operations: reclaim calcium to spares - https://phabricator.wikimedia.org/T116790#1758512 (10RobH) 3NEW a:3RobH [17:58:23] 7Puppet, 6Labs, 6Phabricator, 5Patch-For-Review: On labs phabricator references security extension even though it isn't present - https://phabricator.wikimedia.org/T104904#1758522 (10Negative24) 5Open>3Resolved [17:58:49] PROBLEM - puppet last run on labvirt1011 is CRITICAL: CRITICAL: puppet fail [17:58:56] 6operations: install/setup/deploy lawrencium as eqiad salt-master - https://phabricator.wikimedia.org/T116645#1758524 (10RobH) [17:58:58] 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1758523 (10RobH) 5Open>3Resolved [17:58:59] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1758525 (10RobH) [18:00:04] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151027T1800). Please do the needful. [18:00:28] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 1, unused: 0 [18:00:29] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 10.34% of data above the critical threshold [100000000.0] [18:01:09] Coren: ^ [18:01:24] 6operations, 5Continuous-Integration-Scaling: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1179083 (10RobH) [18:01:25] 6operations, 10hardware-requests, 5Continuous-Integration-Scaling: eqiad: 2 hardware access request for CI isolation on labsnet - https://phabricator.wikimedia.org/T93076#1758535 (10RobH) 5Open>3Resolved I'm resolving this task, as the install of scandium is tracked on T95046. [18:01:42] chasemp: I think ariel is busy catching up on dumps atm; it's expected that the dumps server would be straining. [18:01:58] chasemp: If that lasts, I'll have a closer look for sure. [18:03:59] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:04:23] ACKNOWLEDGEMENT - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [100000000.0] Coren Dumps being caught up - giving it a day. - The acknowledgement expires at: 2015-10-29 18:03:52. [18:05:03] 6operations, 5Continuous-Integration-Scaling: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1758560 (10RobH) [18:05:40] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 0.006 second response time [18:15:10] PROBLEM - Host labvirt1011 is DOWN: PING CRITICAL - Packet loss = 100% [18:15:10] 6operations: install/setup/deploy lawrencium as eqiad salt-master - https://phabricator.wikimedia.org/T116645#1758616 (10RobH) a:5RobH>3ArielGlenn Since this will be a puppetmaster and saltmaster, I'm not sure if we want to sign the keys and have them start calling in for the base parameters or if its best t... [18:15:29] RECOVERY - Host labvirt1011 is UP: PING OK - Packet loss = 0%, RTA = 1.43 ms [18:15:34] (03CR) 10Chad: "Ran through puppet compiler, changes were minimal and expected, plus it passed. https://puppet-compiler.wmflabs.org/1093/" [puppet] - 10https://gerrit.wikimedia.org/r/247304 (owner: 10Chad) [18:15:44] (03CR) 10Alex Monk: "We can just merge instead of cherry-pick/checkout then?" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/248634 (owner: 10Alex Monk) [18:17:09] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [18:19:09] 6operations, 10netops, 10procurement: Decom Tele2 @ eqiad - https://phabricator.wikimedia.org/T115712#1758645 (10emailbot) **`Rob Halsell`** replied via email on `Tue, 27 Oct 2015 11:18:31 -0700` `Fwd: [PROBABLE SPAM] RE: Question about Wikimedia cross-connect disconnect order 1-29270862432` > Email from... [18:19:32] Uh. [18:19:33] Interesting. [18:20:24] (03CR) 10Ottomata: [C: 031] "Just saw this, looks good to me, should I merge?" [puppet] - 10https://gerrit.wikimedia.org/r/247608 (owner: 10Milimetric) [18:21:04] (03CR) 10Milimetric: "I think it's harmless - I did it at your request last week sometime." [puppet] - 10https://gerrit.wikimedia.org/r/247608 (owner: 10Milimetric) [18:24:49] RECOVERY - puppet last run on labvirt1011 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [18:24:58] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1758730 (10mobrovac) [18:25:04] 6operations, 10Analytics, 6Discovery, 10EventBus, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1758728 (10mobrovac) [18:25:37] Coren, sorry, puppetSWAT went for too long and it overlapped with an interview [18:25:52] jynus: s'ok. Interview takes precedence. [18:29:11] Is everithing ok, I think the only things pending were 2 changes blocked by someone else? [18:33:57] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1758782 (10awight) We do want landingpages.tsv.log, it's processed and statistics about the landing page impressi... [18:34:10] (03PS1) 10RobH: setting scandium install params [puppet] - 10https://gerrit.wikimedia.org/r/249196 [18:35:13] (03PS2) 10Rush: labvirt: setup for labvirt1011 install [puppet] - 10https://gerrit.wikimedia.org/r/249173 [18:36:04] (03PS3) 10Rush: labvirt: setup for labvirt1011 install [puppet] - 10https://gerrit.wikimedia.org/r/249173 [18:37:21] aude: wikibase stays at wmf.3 this week? [18:37:53] or rather, I should branch wmf.4 from wmf.3? [18:38:20] 6operations: install/setup/deploy lawrencium as eqiad salt-master - https://phabricator.wikimedia.org/T116645#1758807 (10RobH) a:5ArielGlenn>3RobH It turns out this has far too much memory (it was noted in spares as 32, but instead as 96, likely from old decom hardware memory upgrades.) So, I'm reclaiming t... [18:38:28] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1758812 (10RobH) [18:38:29] 6operations: install/setup/deploy lawrencium as eqiad salt-master - https://phabricator.wikimedia.org/T116645#1758811 (10RobH) [18:38:30] 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1758809 (10RobH) 5Resolved>3Open It turns out this has far too much memory (it was noted in spares as 32, but instead as 96, likely from old decom hardware memory upgrades.) So, I'm rec... [18:40:07] (03CR) 10RobH: [C: 032] setting scandium install params [puppet] - 10https://gerrit.wikimedia.org/r/249196 (owner: 10RobH) [18:40:27] (03Abandoned) 10RobH: adding pending deployment ganglia group and setting it to default [puppet] - 10https://gerrit.wikimedia.org/r/159167 (owner: 10RobH) [18:42:08] (03CR) 10Jcrespo: "This should be ok (access is requires from dbs, terbium, and stat100X, all in 10.x), but requires coordination with analytics for deploy (" [puppet] - 10https://gerrit.wikimedia.org/r/235429 (owner: 10Muehlenhoff) [18:45:19] (03CR) 10RobH: [C: 032] Add production DNS entries for ms-be20[1-2][0-6] Bug:T114712 [dns] - 10https://gerrit.wikimedia.org/r/249179 (https://phabricator.wikimedia.org/T114712) (owner: 10Papaul) [18:47:49] 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1758822 (10mark) >>! In T115288#1758428, @ArielGlenn wrote: > No, this should not be a VM. It should be a dedicated server. Could you elaborate? [18:48:37] (03CR) 10Ori.livneh: burrow: Add new module for burrow (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/248079 (https://phabricator.wikimedia.org/T115669) (owner: 10Madhuvishy) [18:48:59] mutante: Can I beg a review of https://gerrit.wikimedia.org/r/#/c/247304/? [18:50:18] (03PS4) 10Ottomata: Update camus runs [puppet] - 10https://gerrit.wikimedia.org/r/249100 (https://phabricator.wikimedia.org/T113252) (owner: 10Joal) [18:50:24] ostriches: lgtm; have you tested it with the puppet compiler? [18:51:04] ori: Yep, see last comment [18:51:36] (03CR) 10Ottomata: [C: 032] Update camus runs [puppet] - 10https://gerrit.wikimedia.org/r/249100 (https://phabricator.wikimedia.org/T113252) (owner: 10Joal) [18:52:27] ostriches: fine by me, then. i am comfortable merging it, unless you explicitly want to wait for mutante's feedback. [18:52:48] Nah, I'd just been talking with him on irc about it before, if you're comfortable that's awesome thx! [18:59:49] (03PS7) 10Ori.livneh: Move dsh code into scap where it belongs [puppet] - 10https://gerrit.wikimedia.org/r/247304 (owner: 10Chad) [18:59:59] (03CR) 10Ori.livneh: [C: 032 V: 032] Move dsh code into scap where it belongs [puppet] - 10https://gerrit.wikimedia.org/r/247304 (owner: 10Chad) [19:01:21] PROBLEM - Host labvirt1011 is DOWN: PING CRITICAL - Packet loss = 100% [19:02:13] ^known [19:02:45] ostriches: confirmed no-op [19:03:15] 6operations, 5Continuous-Integration-Scaling: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1758864 (10RobH) [19:05:10] (03PS1) 10RobH: install/deploy scandium as zuul merger (ci) server [puppet] - 10https://gerrit.wikimedia.org/r/249205 [19:06:04] 6operations, 10Wikimedia-DNS, 7Mail: lists.wikimedia.org has no SPF record - https://phabricator.wikimedia.org/T54556#1758872 (10Nemo_bis) [19:06:10] (03CR) 10RobH: [C: 032] install/deploy scandium as zuul merger (ci) server [puppet] - 10https://gerrit.wikimedia.org/r/249205 (owner: 10RobH) [19:06:16] 6operations, 10Wikimedia-General-or-Unknown, 7Mail: DomainKeys Identified Mail (DKIM) for lists.wikimedia.org - https://phabricator.wikimedia.org/T54569#1758874 (10Nemo_bis) [19:06:21] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 2 failures [19:07:36] 6operations, 7Mail, 15User-greg: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1758877 (10Nemo_bis) [19:07:37] (03CR) 10Dzahn: "what about parsoid still using dsh? are they affected by this?" [puppet] - 10https://gerrit.wikimedia.org/r/247304 (owner: 10Chad) [19:07:50] ori: yippie thx! [19:08:15] (03CR) 10Chad: "The file was kept (and in the same location), just moved in the puppet repo." [puppet] - 10https://gerrit.wikimedia.org/r/247304 (owner: 10Chad) [19:08:38] 6operations, 10Wikimedia-General-or-Unknown, 7Mail: DomainKeys Identified Mail (DKIM) for wikipedia.org (and other projects) - https://phabricator.wikimedia.org/T58413#1758882 (10Nemo_bis) [19:09:05] (03PS1) 10Milimetric: Publish the new pageviews dataset to dumps [puppet] - 10https://gerrit.wikimedia.org/r/249207 [19:09:26] (03CR) 10Dzahn: "ok, cool. thanks for the patch. just got back" [puppet] - 10https://gerrit.wikimedia.org/r/247304 (owner: 10Chad) [19:10:21] (03PS1) 1020after4: 1.27.0-wmf.4 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249208 [19:10:37] (03PS1) 10Papaul: Add MAC entries for ms-be20[1-2][0-6] Bug:T114712 [puppet] - 10https://gerrit.wikimedia.org/r/249209 (https://phabricator.wikimedia.org/T114712) [19:10:55] (03CR) 1020after4: [C: 032] 1.27.0-wmf.4 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249208 (owner: 1020after4) [19:11:01] (03Merged) 10jenkins-bot: 1.27.0-wmf.4 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249208 (owner: 1020after4) [19:12:09] (03PS1) 1020after4: delete symlinks for 1.26wmf16-18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249210 [19:12:12] (03PS2) 10Ottomata: Publish the new pageviews dataset to dumps [puppet] - 10https://gerrit.wikimedia.org/r/249207 (owner: 10Milimetric) [19:12:21] (03CR) 1020after4: [C: 032] delete symlinks for 1.26wmf16-18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249210 (owner: 1020after4) [19:12:23] (03CR) 10Ottomata: [C: 032 V: 032] Publish the new pageviews dataset to dumps [puppet] - 10https://gerrit.wikimedia.org/r/249207 (owner: 10Milimetric) [19:12:28] (03Merged) 10jenkins-bot: delete symlinks for 1.26wmf16-18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249210 (owner: 1020after4) [19:12:40] 6operations, 5Continuous-Integration-Scaling: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1758928 (10RobH) [19:13:31] RECOVERY - Host labvirt1011 is UP: PING OK - Packet loss = 0%, RTA = 2.76 ms [19:13:36] ok I'm gonna scap 1.27.0-wmf.4 [19:13:44] 6operations, 5Continuous-Integration-Scaling: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1758937 (10RobH) a:5RobH>3hashar I think the service implementation of this would belong to @hashar, so I am assigning this task to him for the final steps. If this i... [19:13:59] 6operations, 5Continuous-Integration-Scaling: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1758939 (10RobH) 5stalled>3Open [19:15:06] !log twentyafterfour@tin Started scap: sync everything for 1.27.0-wmf.4 and point testwiki to the new branch [19:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:15:29] (03CR) 10Ottomata: [C: 032 V: 032] Initial debian packaging [debs/burrow] (debian) - 10https://gerrit.wikimedia.org/r/248344 (https://phabricator.wikimedia.org/T116084) (owner: 10Ottomata) [19:15:31] 6operations, 7Mail, 15User-greg: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1758967 (10Nemo_bis) I filed the standard DKIM/SPF tasks we (eventually) do for all domains, but I'm not going to file the defects in Phabricator itself. >>! In... [19:17:19] (03PS1) 10Hashar: contint: install rake on Jessie [puppet] - 10https://gerrit.wikimedia.org/r/249213 [19:17:32] PROBLEM - puppet last run on labvirt1011 is CRITICAL: Connection refused by host [19:17:32] PROBLEM - RAID on labvirt1011 is CRITICAL: Connection refused by host [19:17:41] PROBLEM - DPKG on labvirt1011 is CRITICAL: Connection refused by host [19:17:51] PROBLEM - salt-minion processes on labvirt1011 is CRITICAL: Connection refused by host [19:17:52] PROBLEM - Disk space on labvirt1011 is CRITICAL: Connection refused by host [19:18:02] PROBLEM - SSH on labvirt1011 is CRITICAL: Connection refused [19:18:33] PROBLEM - dhclient process on labvirt1011 is CRITICAL: Connection refused by host [19:18:48] (03CR) 10Hashar: "That follow up a conversation we had on the JJB patch that created 'rake-jessie'." [puppet] - 10https://gerrit.wikimedia.org/r/249213 (owner: 10Hashar) [19:18:52] PROBLEM - nova-compute process on labvirt1011 is CRITICAL: Connection refused by host [19:18:52] PROBLEM - configured eth on labvirt1011 is CRITICAL: Connection refused by host [19:21:31] uh [19:21:34] oh, 1011 [19:21:37] new stuff! alright [19:21:42] (03CR) 10Ottomata: burrow: Add new module for burrow (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/248079 (https://phabricator.wikimedia.org/T115669) (owner: 10Madhuvishy) [19:22:39] (03PS2) 10Hashar: contint: install rake on Nodepool instances [puppet] - 10https://gerrit.wikimedia.org/r/249213 [19:25:23] (03CR) 10Hashar: [C: 031 V: 032] "Bah 'rake' was already in contint::packages but that one is not included on Nodepool instances." [puppet] - 10https://gerrit.wikimedia.org/r/249213 (owner: 10Hashar) [19:28:25] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1759040 (10awight) [19:29:51] (03CR) 10Hashar: "Related CI change is https://gerrit.wikimedia.org/r/#/c/249219/" [puppet] - 10https://gerrit.wikimedia.org/r/249213 (owner: 10Hashar) [19:30:01] PROBLEM - Apache HTTP on mw2187 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 689 bytes in 0.094 second response time [19:30:17] (03PS1) 10Gergő Tisza: Make nutcracker's auto_eject_hosts setting configurable [puppet] - 10https://gerrit.wikimedia.org/r/249222 (https://phabricator.wikimedia.org/T109173) [19:31:03] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [19:32:34] PROBLEM - HHVM rendering on mw2187 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 689 bytes in 0.096 second response time [19:35:53] "Fatal exception of type MWException" on that box [19:37:13] "No localisation cache found for English" [19:38:01] anyone deploying custome code there? [19:38:44] If not, run sync-common [19:39:01] doubt anyone is messing there, unless w tells you someone's on it [19:39:02] what hoo said ^ [19:39:50] it is codfw, it can wait for the train [19:40:24] It can... but if it spams the error logs, fix it [19:40:31] it will still get hit by pybal [19:40:39] okeeeey [19:41:55] It's the top thing in logstash, heh [19:42:05] Er, fatalmonitor dashboard in logstash, but ya [19:42:13] PROBLEM - Host labvirt1011 is DOWN: PING CRITICAL - Packet loss = 100% [19:42:59] 6operations, 10RESTBase-Cassandra: PID not expanded in heap dumps - https://phabricator.wikimedia.org/T116814#1759083 (10Eevans) 3NEW [19:43:53] (03CR) 10Dzahn: Add monitoring for the kvm ssl cert, labvirt-star (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/249147 (https://phabricator.wikimedia.org/T116332) (owner: 10Andrew Bogott) [19:44:23] !log Ran sync-common on mw2187 to rebuild l10n caches [19:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:44:41] jynus, hoo, ostriches: ^ [19:44:44] RECOVERY - Apache HTTP on mw2187 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.174 second response time [19:44:50] Nice :) [19:44:54] Thanks [19:45:03] RECOVERY - HHVM rendering on mw2187 is OK: HTTP OK: HTTP/1.1 200 OK - 63784 bytes in 2.275 second response time [19:45:30] hey, I am not a deployer, I needed time [19:45:39] :) [19:46:03] deploying also isn't about being a deployer [19:46:06] * developer [19:46:11] although it can be [19:46:20] so easy a manager can do it guys ;) [19:46:27] what? [19:46:31] but what's the underlying cause of this error? [19:46:40] !log twentyafterfour@tin Finished scap: sync everything for 1.27.0-wmf.4 and point testwiki to the new branch (duration: 31m 32s) [19:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:46:51] (03CR) 10Dzahn: "ok, $ARG1 is the name of the cert and the script gets it as $1, but doesn't it somewhere need to know the full path to the script? or wher" [puppet] - 10https://gerrit.wikimedia.org/r/249147 (https://phabricator.wikimedia.org/T116332) (owner: 10Andrew Bogott) [19:47:09] MaxSem: good question. Possibly a hiccup during the scap? [19:47:46] (03CR) 10Dzahn: "[edit] full path to the _cert_ that it checks" [puppet] - 10https://gerrit.wikimedia.org/r/249147 (https://phabricator.wikimedia.org/T116332) (owner: 10Andrew Bogott) [19:47:49] Here are the changes that mw2187 saw -- https://phabricator.wikimedia.org/P2244 [19:48:02] uhm why does https://test.wikipedia.org/wiki/Special:Version say wmf.3 at the top and wmf.4 below? Did I miss something? [19:49:01] bd808: no errors reported by scap [19:49:16] twentyafterfour: looks like a caching thing to me (the mixed versions on that page) [19:49:35] I don't know what invalidates the banner [19:49:39] twentyafterfour: fatals for me [19:49:45] Well, worked second time [19:50:33] bd808, commented out in https://github.com/wikimedia/operations-puppet/blob/production/modules/scap/files/dsh/group/mediawiki-installation [19:50:35] WTF [19:51:06] ah. depooled by still being hit by pybal or something? [19:51:13] s/by/but/ [19:51:13] PROBLEM - HHVM rendering on mw1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:51:25] (03CR) 10Reedy: "Why it was enabled:" [puppet] - 10https://gerrit.wikimedia.org/r/249102 (owner: 10Giuseppe Lavagetto) [19:52:54] RECOVERY - HHVM rendering on mw1017 is OK: HTTP OK: HTTP/1.1 200 OK - 63783 bytes in 0.535 second response time [19:52:57] (03CR) 1020after4: "probably not needed for read-only git" [puppet] - 10https://gerrit.wikimedia.org/r/249102 (owner: 10Giuseppe Lavagetto) [19:53:01] apparently not the first time mw2187 has gotten the same "fix" -- https://tools.wmflabs.org/sal/production?p=0&q=mw2187 [19:54:48] https://phabricator.wikimedia.org/T109717 [19:54:53] reopen? [19:54:55] if $::site == 'eqiad' { [19:54:55] monitoring::service { 'mediawiki-installation DSH group': [19:55:00] * MaxSem bites _joe_ [19:55:45] I introduced this check precisely to avoid this shit happening again and again and again [19:56:03] (03PS1) 10Ori.livneh: Harden grafana setup [puppet] - 10https://gerrit.wikimedia.org/r/249229 [19:56:31] jynus: well if it is commented out in the dsh control file that would explain things I think. Which is what MaxSem is getting ally bitey about [19:56:47] s/y// [19:56:52] (03PS2) 10Ori.livneh: Harden grafana setup [puppet] - 10https://gerrit.wikimedia.org/r/249229 [19:57:01] (03CR) 10Ori.livneh: [C: 032 V: 032] Harden grafana setup [puppet] - 10https://gerrit.wikimedia.org/r/249229 (owner: 10Ori.livneh) [19:57:15] too many repos and places to keep track of [19:57:19] I once seen a host in eqiad serving prod traffic with MW 4 months out of date [19:57:52] nostalgia.wikipedia.org [19:57:53] fundamentally the dsh file generated by puppet and pybal need to agree (which is why we should move to just talking to pybal/etcd directly from scap) [19:57:56] :P [19:59:49] MaxSem: at least puppet runs sync-common now so things shouldn't get more out of date than the puppet run frequency but that's still too much [20:00:34] RECOVERY - Host labvirt1011 is UP: PING OK - Packet loss = 0%, RTA = 1.93 ms [20:03:09] mw1017 was testwiki? [20:03:15] yup [20:03:42] I'm all for having scap talk to etcd directly, it just never has been clear to me how we would do that [20:03:51] I have a bug filed. [20:03:57] And I saw an etcd library or 2 [20:04:01] For python [20:04:08] ok, going to lunch, I will leave fighting the new bugs there [20:04:18] Joe wrote/maintains one as I recall [20:04:21] yes, that would be cool to kill dsh with that [20:04:52] The project to move pybal to using etcd is still underway isn't it? [20:05:08] yea, i just know there is the idea [20:05:31] mutante, we kinda have salt as a dsh replacement. the problem is that it's root only [20:05:41] eh, the idea to let scap talk to etcd i meant there [20:06:12] scap doesn't use dsh it just uses the dsh host files [20:06:31] MaxSem: yes, agree that would also be nice if the salt issue was fixed [20:06:33] nothing actually uses dsh anymore since we killed key forwarding [20:06:51] but the legacy of the control files remains for scap [20:07:14] bd808: deleting mediawiki branches is something I do manually via dsh, occasionally [20:07:15] well there is some hack to use dsh for parsoid restarts too I think [20:07:22] err, what's wrong with mw2208? it's commented out in dsh and I can't log into it [20:07:26] we could start deleting stuff from modules/dsh [20:07:29] also, empty in icinga [20:07:30] except that one file{} [20:07:39] or 2 [20:07:44] I think that patch is proposed isn't it? [20:08:22] yea, well earlier a patch by ostriched has been merged [20:09:18] It's all gone from modules/dsh/ now [20:09:29] The little bit of spaghetti that's left is in modules/scap/ now [20:10:08] :) [20:12:16] (03PS1) 10MaxSem: Reenable DSH checks for codfw, uncomment hosts that are up [puppet] - 10https://gerrit.wikimedia.org/r/249234 [20:12:17] here we go ^^^ [20:13:02] (03PS2) 10Andrew Bogott: Add monitoring for the kvm ssl cert, labvirt-star [puppet] - 10https://gerrit.wikimedia.org/r/249147 (https://phabricator.wikimedia.org/T116332) [20:13:03] akosiaris, how can i access OSM db built for labs? It seems the passwords have changed. [20:16:43] (03CR) 10Dzahn: Reenable DSH checks for codfw, uncomment hosts that are up (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/249234 (owner: 10MaxSem) [20:18:23] (03CR) 10MaxSem: Reenable DSH checks for codfw, uncomment hosts that are up (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/249234 (owner: 10MaxSem) [20:19:38] (03CR) 10GWicke: [C: 031] cassandra: updated gc settings [puppet] - 10https://gerrit.wikimedia.org/r/248960 (https://phabricator.wikimedia.org/T106619) (owner: 10Eevans) [20:20:08] (03PS2) 10Dzahn: Reenable DSH checks for codfw, uncomment hosts that are up [puppet] - 10https://gerrit.wikimedia.org/r/249234 (owner: 10MaxSem) [20:21:07] (03CR) 10Dzahn: [C: 032] Reenable DSH checks for codfw, uncomment hosts that are up [puppet] - 10https://gerrit.wikimedia.org/r/249234 (owner: 10MaxSem) [20:22:53] (03PS3) 10Dzahn: Add monitoring for the kvm ssl cert, labvirt-star [puppet] - 10https://gerrit.wikimedia.org/r/249147 (https://phabricator.wikimedia.org/T116332) (owner: 10Andrew Bogott) [20:23:42] hello :-) [20:23:42] I could use 'rake' on the CI slaves, forgot to get it installed on Nodepool instances: https://gerrit.wikimedia.org/r/#/c/249213/ :-} [20:23:58] applied on labs already [20:24:29] (03CR) 10Dzahn: [C: 032] Add monitoring for the kvm ssl cert, labvirt-star [puppet] - 10https://gerrit.wikimedia.org/r/249147 (https://phabricator.wikimedia.org/T116332) (owner: 10Andrew Bogott) [20:26:41] (03PS3) 10Dzahn: contint: install rake on Nodepool instances [puppet] - 10https://gerrit.wikimedia.org/r/249213 (owner: 10Hashar) [20:26:53] (03PS1) 10BBlack: reqstats: test on cp1065 (eqiad text) as well [puppet] - 10https://gerrit.wikimedia.org/r/249237 [20:26:56] mutante: you are a hero :-} [20:27:11] (03CR) 10Dzahn: [C: 032] contint: install rake on Nodepool instances [puppet] - 10https://gerrit.wikimedia.org/r/249213 (owner: 10Hashar) [20:29:22] andrewbogott: MaxSem: both of yours should appear in icinga soon.. neon is running. i already saw it add the checks right now [20:29:29] (03CR) 10Hashar: "Danke mutante. Will refresh the Nodepool instances" [puppet] - 10https://gerrit.wikimedia.org/r/249213 (owner: 10Hashar) [20:29:38] hashar: labnodepool1001.eqiad ? [20:29:42] thanks mutante! [20:29:44] 6operations, 10Deployment-Systems: Investigate whether mod_dav needs to stay enable on tin/terbium - https://phabricator.wikimedia.org/T116823#1759278 (10Reedy) 3NEW [20:29:46] mutante: na unrelated [20:29:57] (03PS3) 10Yuvipanda: Revert "Revert "dynamicproxy: Make blocked user agents configurable"" [puppet] - 10https://gerrit.wikimedia.org/r/249182 (owner: 10Alex Monk) [20:29:59] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1759285 (10BBlack) >>! In T102566#1732320, @Tgr wrote: > Yes, a couple hours ago. We should write to mediawiki-announce, wait... [20:30:10] (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "Revert "dynamicproxy: Make blocked user agents configurable"" [puppet] - 10https://gerrit.wikimedia.org/r/249182 (owner: 10Alex Monk) [20:30:10] mutante: the manifest is used for a script that build a reference image to boot instances out of it. labnodepool1001.eqiad.wmnet is unchanged :-} [20:30:25] 6operations, 10Deployment-Systems: Investigate whether mod_dav needs to stay enabled on tin/terbium - https://phabricator.wikimedia.org/T116823#1759287 (10Reedy) [20:30:32] hashar: ok, i was checking there if it installed rake. all ok then [20:32:31] mutante: thank you :-} [20:32:39] MaxSem: works. status PENDING on https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=dsh+ [20:32:53] =) [20:32:58] except "nobelium" (?) [20:33:25] since 4 days though [20:34:01] '10.64.37.14/32', # nobelium, temporary mw install to copy over es indices [20:34:23] 6operations, 10ops-eqiad, 10netops: asw-d-eqiad SNMP failures - https://phabricator.wikimedia.org/T112781#1759293 (10BBlack) I removed the interface-level `disable` from all of these ports (xe-8/0/23-28), so testing them further should just be a matter of plugging in more SFPs and cables. [20:34:40] aha.. ok [20:35:01] i guess that's an ACK for not being in dsh group [20:35:24] or is it [20:35:58] in terms of updates to nobelium, it doesn't need to recieve code updates in any way [20:36:17] the only reason we brought the code over there is because terbium runs php5, which runs the elasticsearch copy process at 50-75% of a cpu per proces [20:36:27] running under hhvm brings it to 3-5% and we can run *much* more at the same time [20:36:55] basically, when running on terbium we had 30 machines and 60 SSD's on one side, and 24 machines and 48 SSD's on the other side, all bottlenecked by terbium running php 5.3 [20:37:05] ACKNOWLEDGEMENT - mediawiki-installation DSH group on nobelium is CRITICAL: Host nobelium is not in mediawiki-installation dsh group daniel_zahn temporary mw install to copy over es indices [20:37:27] ok, then just this. thanks for explaining [20:39:25] the only issue with this is.. even more checks on neon which is so busy [20:39:45] we need to replace neon [20:40:02] if its easier, we can send code updates there [20:40:06] just saying its not necessary [20:40:13] oh that'll be a fun thing mutante :) [20:40:23] ebernhardson: is the codfw copy also running on nobelium? [20:40:45] YuviPanda: yes, the initial copy finished but based on turning on the logging and looking at things it seems a couple got missed, so running it again [20:40:56] ah ok [20:40:58] err, turning on the live updates from eqiad->codfw yesterday [20:41:03] didn't realize that was going on too [20:41:08] but nice to not tax terbium [20:41:41] andrewbogott: so, by putting that kvm ssl cert check into the role, we have it on each labvirt server, not just once. is that as designed or actually more than needed [20:41:56] it’s more than needed but should be harmless [20:42:11] I couldn’t think of a graceful way to limit it to one host [20:42:20] YuviPanda: the funny thing is even with 40 processes copying eqiad->codfw, and 12 processes copying eqiad->labsearch, its still only holding a load average of 8 :) [20:42:27] 6operations, 6Labs, 10Tool-Labs: Have a paging check in icinga for tools home page going down - https://phabricator.wikimedia.org/T116825#1759311 (10yuvipanda) 3NEW [20:42:31] andrewbogott: ok, so the first of them just got the first status, it is UNKNOWN (3) [20:42:39] ebernhardson: nice :D is a pretty hunky machine [20:42:45] mutante: well, that surprises me [20:42:52] mutante: I can debug in a bit, if you don’t want to [20:43:27] mutante: hmm wondering where I should place the check on https://phabricator.wikimedia.org/T116825?workflow=create [20:43:42] mutante, running these checks once an hour is fine [20:44:18] I think that's what it tries to do [20:45:07] andrewbogott: re: multiple checks, i agree it's harmless or maybe even better to check each, the only downside is neon resources. for some of the ones that are on multiple hosts i put them on a virtual host i add with @monitoring::host [20:45:35] andrewbogott: looking at the plugin now [20:45:43] thanks [20:46:57] maybe I should make a small class in tools and include that... somwhere [20:47:01] labcontrol maybe [20:47:07] MaxSem: yes, normal_check_interal 60 confirmed [20:47:26] (other checks have "1") [20:48:35] YuviPanda: @monitoring::host { 'tools.wmflabs.org': [20:48:44] oh that's a thing? [20:48:46] host_fqdn => 'tools.wmflabs.org' [20:48:47] interesting [20:48:48] } [20:48:55] and then you use that host in an icinga service [20:48:58] as if it was real [20:49:40] you can take a look at modules/icinga/manifests/monitor/certs.pp [20:49:56] that's what i did to check certs on external servers [20:49:57] like blog [20:50:55] mutante: nice! [20:50:57] looking [20:51:25] mutante: so should this be toollabs::monitoring::icinga? [20:51:30] or icinga::monitor::toollabs? [20:51:32] hmm [20:51:35] I guess the former [20:52:38] yea.. i'm not really sure. i could see some small advantage in both i guess :p [20:52:47] RECOVERY - SSH on labvirt1011 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [20:52:58] just used icinga/manifests/monitor/ because we had other stuff there [20:53:27] but if you put it _not_ into icinga itself [20:53:28] yeah [20:53:42] but into a module, then folks will say "put it int he role" [20:53:46] maybe :) [20:54:52] (03Abandoned) 10Muehlenhoff: Mark calcium as testsystem [puppet] - 10https://gerrit.wikimedia.org/r/247523 (owner: 10Muehlenhoff) [20:55:52] [20:56:19] mutante: yeah but all the toollabs roles are included in labs instances [20:56:24] this will be included in a prod instance [20:56:26] so... [20:56:45] YuviPanda: toollabs:not-labs::monitoring :D [20:57:33] or tool-not-labs::monitoring [underscores not hyphen too!] [20:59:09] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure, 3labs-sprint-118: Rack/Setpup labvirt1010 and 1011 - https://phabricator.wikimedia.org/T116019#1759380 (10chasemp) andrew tracked down https://wikitech.wikimedia.org/wiki/HP_DL360Gen9#Embedded_user_partition Which seems like: http://www8.hp.com/... [20:59:22] (03PS3) 10MaxSem: [WIP] Switch www portals to be deployed from Git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248526 (https://phabricator.wikimedia.org/T115964) [21:01:27] (03PS1) 10Andrew Bogott: Modify labvirt-ssd to more closely resemble db.cfg [puppet] - 10https://gerrit.wikimedia.org/r/249291 [21:08:19] (03CR) 10Andrew Bogott: [C: 032] Modify labvirt-ssd to more closely resemble db.cfg [puppet] - 10https://gerrit.wikimedia.org/r/249291 (owner: 10Andrew Bogott) [21:09:14] YuviPanda: got disconnected. back. so i ended up with [21:09:15] (03PS1) 10Yuvipanda: tools: Make main page check paging with icinga [puppet] - 10https://gerrit.wikimedia.org/r/249292 (https://phabricator.wikimedia.org/T116925) [21:10:00] mutante: yeah, last message was 'maybe :)' [21:10:50] YuviPanda: "in the role class where it makes sense, and if it's a global thing, then in icinga itself" but i see your special case with labs and prod [21:11:12] (03PS1) 10Ori.livneh: xenon: sort files entries in descending date order [puppet] - 10https://gerrit.wikimedia.org/r/249293 [21:11:17] 6operations, 6Labs, 10Tool-Labs: Have a paging check in icinga for tools home page going down - https://phabricator.wikimedia.org/T116825#1759445 (10yuvipanda) >>! In T113979#1759389, @scfc wrote: > I'd prefer if we don't combine human users and service groups in one group, but create another (`project-tools... [21:11:24] (03CR) 10Ori.livneh: [C: 032 V: 032] xenon: sort files entries in descending date order [puppet] - 10https://gerrit.wikimedia.org/r/249293 (owner: 10Ori.livneh) [21:14:53] mutante: what do you think of https://gerrit.wikimedia.org/r/#/c/249292/ [21:14:58] 6operations, 10hardware-requests: Allocate hardware for salt master in eqiad - https://phabricator.wikimedia.org/T115288#1759460 (10RobH) My bad on skipping @mark for approvals, I was under the impression that this was roadmapped, expected, and discussed during our recent ops meeting. So, pending his sign off... [21:16:14] (03PS1) 10Andrew Bogott: Typo fix in labvirt-ssd.cfg [puppet] - 10https://gerrit.wikimedia.org/r/249294 [21:17:32] (03CR) 10Andrew Bogott: [C: 032] Typo fix in labvirt-ssd.cfg [puppet] - 10https://gerrit.wikimedia.org/r/249294 (owner: 10Andrew Bogott) [21:17:48] PROBLEM - puppet last run on ms-be2006 is CRITICAL: CRITICAL: puppet fail [21:20:34] (03CR) 10Dzahn: "using @monitoring::host { 'tools.wmflabs.org' is good, that should just work." [puppet] - 10https://gerrit.wikimedia.org/r/249292 (https://phabricator.wikimedia.org/T116925) (owner: 10Yuvipanda) [21:21:04] YuviPanda: it looks good, i'd just do it without actual paging first and then make the switch later [21:21:15] mutante: good catch [21:23:24] (03PS2) 10Yuvipanda: tools: Make main page check in icinga [puppet] - 10https://gerrit.wikimedia.org/r/249292 (https://phabricator.wikimedia.org/T116925) [21:25:17] (03PS3) 10Yuvipanda: tools: Make main page check in icinga [puppet] - 10https://gerrit.wikimedia.org/r/249292 (https://phabricator.wikimedia.org/T116925) [21:25:26] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Make main page check in icinga [puppet] - 10https://gerrit.wikimedia.org/r/249292 (https://phabricator.wikimedia.org/T116925) (owner: 10Yuvipanda) [21:26:52] (03PS1) 1020after4: group0 wikis to 1.27.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249298 [21:27:24] 6operations: Investigate idle/depooled eqiad appservers - https://phabricator.wikimedia.org/T116256#1759513 (10RobH) [21:29:00] (03CR) 1020after4: [C: 032] group0 wikis to 1.27.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249298 (owner: 1020after4) [21:29:07] (03Merged) 10jenkins-bot: group0 wikis to 1.27.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249298 (owner: 1020after4) [21:29:28] !log ori@tin Synchronized php-1.27.0-wmf.3/includes/debug/logger/LoggerFactory.php: I437bcb532: LoggerFactory: Only check for Psr\Log\LoggerInterface once (duration: 00m 18s) [21:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:29:49] !log twentyafterfour@tin rebuilt wikiversions.cdb and synchronized wikiversions files: group0 wikis to 1.27.0-wmf.4 [21:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:32:16] mutante: nice! https://icinga.wikimedia.org/icinga/ is at PENDING now [21:32:23] stupid icinga [21:32:40] "Notice: Undefined index: timeout in /srv/mediawiki/php-1.27.0-wmf.4/includes/objectcache/MemcachedPeclBagOStuff.php on line 86" [21:32:46] <_joe_> ostriches: a python library for etcd? [21:32:51] <_joe_> ostriches: what for? :) [21:33:10] <_joe_> bd808: yeah the etcd code for pybal is merged, pending deployment [21:33:20] <_joe_> but terbium and tin have precedence atm [21:33:32] <_joe_> so... maybe next week [21:33:45] 6operations: Investigate idle/depooled eqiad appservers - https://phabricator.wikimedia.org/T116256#1759555 (10RobH) [21:33:46] 6operations, 10ops-eqiad, 5Patch-For-Review: mw1083's sda disk is dying - https://phabricator.wikimedia.org/T116184#1759556 (10RobH) [21:34:26] <_joe_> ostriches: the dsh file can be autogenerated from etcd... with a simple daemon we already run in prod [21:34:26] 6operations, 10ops-eqiad, 5Patch-For-Review: mw1061 has a faulty disk, filesystem is read-only - https://phabricator.wikimedia.org/T107849#1759558 (10RobH) [21:34:27] 6operations: Investigate idle/depooled eqiad appservers - https://phabricator.wikimedia.org/T116256#1744570 (10RobH) [21:34:31] <_joe_> I must work on that too [21:34:56] 6operations: Investigate idle/depooled eqiad appservers - https://phabricator.wikimedia.org/T116256#1744570 (10RobH) [21:35:12] _joe_: Even easier then :) [21:35:39] <_joe_> ostriches: but yeah, _the_ library to use in python is python-etcd :) [21:35:52] yep, that was the one I was looking at [21:35:57] 6operations: Investigate idle/depooled eqiad appservers - https://phabricator.wikimedia.org/T116256#1759595 (10RobH) [21:35:59] 6operations, 10ops-eqiad, 5Patch-For-Review: mw1061 has a faulty disk, filesystem is read-only - https://phabricator.wikimedia.org/T107849#1759591 (10RobH) 5Resolved>3Open We actually need to add three more video scalers per T114337. Since this is still depooled, I'll assign this system mw1061 to that... [21:36:16] 6operations, 10TimedMediaHandler, 10hardware-requests: Assign 3 more servers to video scaler duty - https://phabricator.wikimedia.org/T114337#1691895 (10RobH) mw1061 is allocated to this task. (2 more to go) [21:36:54] <_joe_> robh: I think you should reinstall those systems, or keep a close eye when changing puppet role [21:37:34] mutante: ok, it's green now. I'll make it paging now [21:38:40] 6operations: Investigate idle/depooled eqiad appservers - https://phabricator.wikimedia.org/T116256#1759601 (10RobH) [21:40:28] (03PS1) 10Yuvipanda: tools: Make home page check critical [puppet] - 10https://gerrit.wikimedia.org/r/249300 (https://phabricator.wikimedia.org/T116925) [21:40:32] (03CR) 10jenkins-bot: [V: 04-1] tools: Make home page check critical [puppet] - 10https://gerrit.wikimedia.org/r/249300 (https://phabricator.wikimedia.org/T116925) (owner: 10Yuvipanda) [21:40:46] (03PS2) 10Yuvipanda: tools: Make home page check critical [puppet] - 10https://gerrit.wikimedia.org/r/249300 (https://phabricator.wikimedia.org/T116925) [21:42:24] YuviPanda: ok. and since we both used the same virtual host, your HTTP and NFS checks are grouped with the HTTPS/cert check i added. and we monitor both protocols https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=tools.wmflabs.org&nostatusheader [21:42:36] mutante: yup, is quite nice [21:42:40] where the cert expiry shouldn't be paging [21:42:48] yeah [21:42:49] unless it already happened or something :p [21:42:51] heh [21:43:01] (03CR) 10coren: [C: 031] "Yep. We want the noise." [puppet] - 10https://gerrit.wikimedia.org/r/249300 (https://phabricator.wikimedia.org/T116925) (owner: 10Yuvipanda) [21:43:01] awww fuck [21:43:07] the cat just puked on my bed [21:43:18] :o [21:43:19] _joe_: reinstall sounds safer! [21:43:28] i did not expect that second line after "fuck" [21:43:32] so i think that its worth the 10m of unattended reinstall. [21:43:33] but something more techical :[p [21:44:31] RECOVERY - puppet last run on ms-be2006 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [21:44:43] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1759636 (10Tgr) [[ https://lists.wikimedia.org/pipermail/mediawiki-announce/2015-October/000183.html | Done. ]] [21:48:31] ori: ? [21:50:07] !log catrope@tin Synchronized php-1.27.0-wmf.3/extensions/Flow/: Fix for cache key bug (duration: 00m 20s) [21:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:50:20] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 virgin: 25) [21:50:47] !log aaron@tin Synchronized php-1.27.0-wmf.4/includes/MovePage.php: 7a7c7b27d6c (duration: 00m 17s) [21:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:52:22] mutante: a critical for fermium which seems legit - I'm surprised [21:53:04] matanya: ? [21:54:01] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. [21:54:08] hi ori he.wikipedia is facing a high profile bug in google affecting appearnce of aritcles, do we have friends at google to help use with that ? [21:54:33] basiclly the bug is articles with ' in them don't so in google search [21:54:56] (the hebrew tag גרש not the english apostrof) [21:55:16] *help us [21:56:12] https://phabricator.wikimedia.org/T112425 [21:56:27] matanya: pretty sure that affects all wikis, and no one seems to care much, alas [21:56:37] matanya: not really, sadly. not anywhere near search. we've reached out to them before with very innocent questions ("would outputting the table of contents collapse / expand element in the page html show up in search snippets"?) and they don't answer [21:56:44] indeed that, MatmaRex thanks [21:56:59] they're insanely hyper careful not to appear to be giving insider information to anyone [21:57:25] oh well, thanks ori [21:57:28] the best contact is probably denny vrandečić [21:57:46] he's a wmf trustee and he works for google [21:58:02] yeah, i know who he is [21:58:39] 6operations, 10ops-codfw, 7Swift: rack & initial on-site setup of ms-be2016-2021 - https://phabricator.wikimedia.org/T114712#1759718 (10RobH) I've committed all the needed changes for the switch port description, enable, and vlan for ms-be2016 thorugh ms-be2021 [22:03:23] 6operations, 7Swift: install/setup/deploy ms-be2016-2021 - https://phabricator.wikimedia.org/T116842#1759738 (10RobH) 3NEW [22:03:30] RECOVERY - RAID on labvirt1011 is OK: OK: no RAID installed [22:03:36] 6operations, 10ops-codfw, 7Swift: rack & initial on-site setup of ms-be2016-2021 - https://phabricator.wikimedia.org/T114712#1703678 (10RobH) [22:03:37] 6operations, 7Swift: install/setup/deploy ms-be2016-2021 - https://phabricator.wikimedia.org/T116842#1759747 (10RobH) [22:03:40] (03CR) 10Rush: [C: 031] "cool, may want to let -ops know since this is new" [puppet] - 10https://gerrit.wikimedia.org/r/249300 (https://phabricator.wikimedia.org/T116925) (owner: 10Yuvipanda) [22:03:51] RECOVERY - Disk space on labvirt1011 is OK: DISK OK [22:04:12] RECOVERY - configured eth on labvirt1011 is OK: OK - interfaces up [22:04:40] RECOVERY - DPKG on labvirt1011 is OK: All packages OK [22:04:41] RECOVERY - dhclient process on labvirt1011 is OK: PROCS OK: 0 processes with command name dhclient [22:04:52] (03CR) 10Andrew Bogott: [C: 031] tools: Make home page check critical [puppet] - 10https://gerrit.wikimedia.org/r/249300 (https://phabricator.wikimedia.org/T116925) (owner: 10Yuvipanda) [22:05:11] RECOVERY - puppet last run on labvirt1011 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [22:06:50] RECOVERY - nova-compute process on labvirt1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [22:06:52] chasemp: good call, I' emailed [22:07:01] chasemp: I'll merge it tomorrow [22:10:13] (03PS1) 10Legoktm: Set $wgRCWatchCategoryMembership = true on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249306 [22:12:10] (03PS2) 10Dzahn: Add MAC entries for ms-be20[1-2][0-6] Bug:T114712 [puppet] - 10https://gerrit.wikimedia.org/r/249209 (https://phabricator.wikimedia.org/T114712) (owner: 10Papaul) [22:13:26] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure, 3labs-sprint-118: Rack/Setpup labvirt1010 and 1011 - https://phabricator.wikimedia.org/T116019#1759822 (10chasemp) 5Open>3Resolved Great all set. I did the puppet stuff and booted one vm manually on each to verify. Seems solid. great team eff... [22:13:51] (03Abandoned) 10Rush: labvirt: setup for labvirt1011 install [puppet] - 10https://gerrit.wikimedia.org/r/249173 (owner: 10Rush) [22:13:55] (03CR) 10Dzahn: [C: 032] "checked 2 random MACs on the mgmt consoles too. There are actually 6 ports, and these are port 5 and that is exactly what papaul said it w" [puppet] - 10https://gerrit.wikimedia.org/r/249209 (https://phabricator.wikimedia.org/T114712) (owner: 10Papaul) [22:23:35] (03CR) 10Dzahn: [C: 031] "+ 1 per Brandon's reasoning and also per Mozilla" [puppet] - 10https://gerrit.wikimedia.org/r/249017 (owner: 10BBlack) [22:24:54] (03CR) 10Greg Grossmeier: [C: 031] beta: Add enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248639 (owner: 10Alex Monk) [22:28:44] 6operations, 10Wikimedia-General-or-Unknown: Content of Special:BrokenRedirects and many others query pages not updated since 2015-09-16 - https://phabricator.wikimedia.org/T113721#1759899 (10IKhitron) 5Resolved>3Open [22:29:18] 6operations, 10Wikimedia-General-or-Unknown: Content of Special:BrokenRedirects and many others query pages not updated since 2015-09-16 - https://phabricator.wikimedia.org/T113721#1674279 (10IKhitron) Same problem at twice-a-month-queries. See "Special:DeadendPages" for example. [22:37:05] 6operations, 10Wikimedia-General-or-Unknown: Content of Special:BrokenRedirects and many others query pages not updated since 2015-09-16 - https://phabricator.wikimedia.org/T113721#1759934 (10Krenair) ```krenair@terbium:/var/log/mediawiki/updateSpecialPages$ cat *DeadendPages.log The specified dblist file, /sr... [22:38:32] mutante, hey [22:39:28] 6operations, 10Wikimedia-General-or-Unknown: Content of Special:BrokenRedirects and many others query pages not updated since 2015-09-16 - https://phabricator.wikimedia.org/T113721#1759959 (10IKhitron) Still a problem :-) [22:39:41] !log powercycling unresponsive analytics1039, here's what i saw on mgmt https://phabricator.wikimedia.org/P2248 [22:39:42] (03PS1) 10Alex Monk: Fix mwscriptwikiset dblists paths [puppet] - 10https://gerrit.wikimedia.org/r/249310 (https://phabricator.wikimedia.org/T113721) [22:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:40:12] Krenair: yep [22:40:42] mutante, please see that commit I just uploaded [22:41:11] BUG: soft lockup - CPU#23 stuck for 23s [22:41:51] ooh, that was broken for a while then? [22:41:53] reads ticket [22:42:16] RECOVERY - Check size of conntrack table on analytics1039 is OK: OK: nf_conntrack is 0 % full [22:42:26] RECOVERY - SSH on analytics1039 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [22:42:26] RECOVERY - dhclient process on analytics1039 is OK: PROCS OK: 0 processes with command name dhclient [22:42:27] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [22:42:28] yes, since the dblist changes probably [22:42:40] 6operations: Hardware Automation Workflow - Overall Tracking - https://phabricator.wikimedia.org/T116063#1759986 (10RobH) [22:42:46] RECOVERY - configured eth on analytics1039 is OK: OK - interfaces up [22:42:47] RECOVERY - DPKG on analytics1039 is OK: All packages OK [22:42:56] RECOVERY - RAID on analytics1039 is OK: OK: optimal, 13 logical, 14 physical [22:43:07] RECOVERY - Disk space on analytics1039 is OK: DISK OK [22:43:17] RECOVERY - salt-minion processes on analytics1039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:43:28] RECOVERY - Hadoop NodeManager on analytics1039 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [22:43:28] RECOVERY - Disk space on Hadoop worker on analytics1039 is OK: DISK OK [22:43:48] RECOVERY - Hadoop DataNode on analytics1039 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [22:44:05] (03CR) 10BryanDavis: [C: 031] Fix mwscriptwikiset dblists paths [puppet] - 10https://gerrit.wikimedia.org/r/249310 (https://phabricator.wikimedia.org/T113721) (owner: 10Alex Monk) [22:44:17] (03CR) 10Dzahn: [C: 032] "confirmed on terbium the path is:" [puppet] - 10https://gerrit.wikimedia.org/r/249310 (https://phabricator.wikimedia.org/T113721) (owner: 10Alex Monk) [22:44:22] (03PS2) 10Dzahn: Fix mwscriptwikiset dblists paths [puppet] - 10https://gerrit.wikimedia.org/r/249310 (https://phabricator.wikimedia.org/T113721) (owner: 10Alex Monk) [22:45:26] RECOVERY - puppet last run on analytics1039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:45:48] RECOVERY - NTP on analytics1039 is OK: NTP OK: Offset -0.001217961311 secs [22:47:17] Krenair: on terbium: +if [ ! -f $MEDIAWIKI_DEPLOYMENT_DIR/dblists/$LISTFILE ]; then [22:47:31] thanks [22:48:57] (03CR) 10Jdlrobson: [C: 031] "does this include a mobile equivalent?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248639 (owner: 10Alex Monk) [22:49:24] 10Ops-Access-Requests, 6operations, 10Wikidata: Requesting access to researchers & statistics-users for Addshore - https://phabricator.wikimedia.org/T116784#1760012 (10Addshore) Going from the descriptions in data.yaml and @JanZerebecki's current groups we believed statistics-users was needed for the access... [22:52:05] (03CR) 10Alex Monk: ""Successfully added *.m.wikivoyage.beta.wmflabs.org entry for IP address 208.80.155.139."" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248639 (owner: 10Alex Monk) [22:53:53] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/1094/" [puppet] - 10https://gerrit.wikimedia.org/r/249060 (owner: 10Dzahn) [22:53:58] (03PS5) 10Dzahn: logstash: fix double quoted strings & alignments [puppet] - 10https://gerrit.wikimedia.org/r/249060 [22:54:36] 10Ops-Access-Requests, 6operations, 10Wikidata: Requesting access to researchers & statistics-users for Addshore - https://phabricator.wikimedia.org/T116784#1760035 (10Krenair) Please someone end this analytics group madness. @Addshore: Take a look at my groups, I have access to stat1003 and the research DB... [22:55:25] 6operations, 6Labs, 10Tool-Labs, 7Icinga, 7Monitoring: Have a paging check in icinga for tools home page going down - https://phabricator.wikimedia.org/T116825#1760039 (10Dzahn) [22:59:39] 6operations, 6Labs, 10Labs-Infrastructure, 7Monitoring, and 2 others: monitor expiration of labvirt-star SSL cert - https://phabricator.wikimedia.org/T116332#1760065 (10Dzahn) let me steal this for a moment, to check out why they became status UNKNOWN. i said earlier i would but didn't get to it yet. [22:59:45] 6operations, 6Labs, 10Labs-Infrastructure, 7Monitoring, and 2 others: monitor expiration of labvirt-star SSL cert - https://phabricator.wikimedia.org/T116332#1760068 (10Dzahn) a:5Andrew>3Dzahn [23:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151027T2300). Please do the needful. [23:00:04] legoktm: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:13] hello [23:00:27] I could deploy it I suppose [23:00:55] (03CR) 10Legoktm: [C: 032] Set $wgRCWatchCategoryMembership = true on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249306 (owner: 10Legoktm) [23:00:58] hi [23:01:02] (03Merged) 10jenkins-bot: Set $wgRCWatchCategoryMembership = true on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249306 (owner: 10Legoktm) [23:01:03] I have an item or two to add [23:01:58] !log legoktm@tin Synchronized wmf-config/InitialiseSettings-labs.php: https://gerrit.wikimedia.org/r/249306, no-op (duration: 00m 18s) [23:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:02:07] ok, well I'm done [23:02:40] (03PS4) 10Alex Monk: beta: Add enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248639 [23:02:46] (03CR) 10Alex Monk: [C: 032] beta: Add enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248639 (owner: 10Alex Monk) [23:02:53] (03Merged) 10jenkins-bot: beta: Add enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248639 (owner: 10Alex Monk) [23:02:56] 6operations: Update dsh node groups from puppet - https://phabricator.wikimedia.org/T80395#874995 (10Dzahn) @demon @bd808 what should we do with this ticket? reject? resolve once it uses etcd? [23:02:58] (03PS1) 10BryanDavis: Monolog: Use useMicrosecondTimestamps() on Loggers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249313 (https://phabricator.wikimedia.org/T116550) [23:04:53] (03CR) 10BryanDavis: [C: 04-2] "Blocked by core and vendor changes. May be safe to roll with 1.27.0-wmf.5 (around 2015-11-05)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249313 (https://phabricator.wikimedia.org/T116550) (owner: 10BryanDavis) [23:05:01] !log krenair@tin Synchronized dblists/all-labs.dblist: https://gerrit.wikimedia.org/r/#/c/248639/ (duration: 00m 18s) [23:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:05:29] !log krenair@tin Synchronized wikiversions-labs.json: https://gerrit.wikimedia.org/r/#/c/248639/ (duration: 00m 17s) [23:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:05:59] !log krenair@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/248639/ (duration: 00m 17s) [23:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:06:30] 6operations, 6Labs, 10Tool-Labs, 7Icinga, 7Monitoring: Have a paging check in icinga for tools home page going down - https://phabricator.wikimedia.org/T116825#1760122 (10Dzahn) Yuvipanda did it here: https://gerrit.wikimedia.org/r/#/c/249292/ exists now along with the SSL expiry check on the same virt... [23:07:46] (03CR) 10Dzahn: "bug in the commmit message = 404" [puppet] - 10https://gerrit.wikimedia.org/r/249300 (https://phabricator.wikimedia.org/T116925) (owner: 10Yuvipanda) [23:07:46] PROBLEM - HHVM rendering on mw2050 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50412 bytes in 0.145 second response time [23:07:57] PROBLEM - HHVM rendering on mw2128 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50412 bytes in 0.146 second response time [23:08:08] (03PS3) 10Dzahn: tools: Make home page check critical [puppet] - 10https://gerrit.wikimedia.org/r/249300 (https://phabricator.wikimedia.org/T116825) (owner: 10Yuvipanda) [23:08:15] * Krenair grumbles [23:08:20] apparently beta has had an enwikivoyage before [23:08:38] PROBLEM - Apache HTTP on mw2209 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50412 bytes in 0.152 second response time [23:08:38] there's a DB [23:08:47] You make it sound like a disease [23:08:48] PROBLEM - Apache HTTP on mw2050 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50412 bytes in 0.148 second response time [23:08:49] :P [23:08:52] (03PS4) 10Dzahn: tools: Make home page check critical [puppet] - 10https://gerrit.wikimedia.org/r/249300 (https://phabricator.wikimedia.org/T116825) (owner: 10Yuvipanda) [23:09:00] why are those codfw apaches failing? [23:09:36] PROBLEM - HHVM rendering on mw2209 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50412 bytes in 0.148 second response time [23:09:36] PROBLEM - Apache HTTP on mw2128 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50412 bytes in 0.145 second response time [23:10:16] let me check if these are the servers that were added to dsh groups earlier today [23:10:27] * ori checks fluorine [23:11:13] yes, they are [23:11:13] Oct 27 23:10:56 mw2128 hhvm: {#012 "message": "Class undefined: MWWikiversions",#012 "file": "/srv/mediawiki/wmf-config/CommonSettings.php",#012 [23:11:17] https://gerrit.wikimedia.org/r/#/c/249234/2/modules/scap/files/dsh/group/mediawiki-installation [23:11:23] they were commented in dsh groups [23:11:29] and now they are in it again [23:11:34] and it's the first time there was sync [23:11:39] a sync is not adequate [23:11:40] !log rebooted restbase1007 to rule out a funky hardware state causing elevated read latencies [23:11:43] there needs to be a sync-common [23:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:12:03] i'll run one [23:12:26] 2050, 2128, 2187, 2209 [23:12:29] was the automatic creation of submodule patches disabled? [23:12:34] thanks [23:12:40] tgr, several times [23:12:46] and then reenabled [23:12:49] MaxSem: ^ we needed a sync-common on those that were commented in dsh [23:12:53] last time I checked they were enabled [23:12:54] !log ran `sudo mdadm --readwrite md1` on restbase1007 to resolve `pending` state [23:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:12:59] and have been for a while [23:13:04] mutante: already running them [23:13:09] Krenair: just these https://gerrit.wikimedia.org/r/#/c/249234/2/modules/scap/files/dsh/group/mediawiki-installation [23:13:38] !log running sync-common on mw2050, mw2128, mw2187 and mw2209 (cf I324134438955c7) [23:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:14:49] So enwikivoyage in beta was actually created 22:51, 18 October 2012 [23:14:53] The DB, anyway [23:14:55] no content until today [23:15:44] other than the first revision [23:15:54] 6operations, 6Labs, 10Tool-Labs, 7Icinga, and 2 others: Have a paging check in icinga for tools home page going down - https://phabricator.wikimedia.org/T116825#1760169 (10Dzahn) here's the part if it should send SMS to all of ops: https://gerrit.wikimedia.org/r/#/c/249300/ [23:15:59] I imagine Reedy is to blame :P [23:16:04] Krenair: doesn't seem to be working, I just merged https://gerrit.wikimedia.org/r/#/c/249314/ [23:16:09] 6operations, 6Labs, 10Tool-Labs, 7Icinga, and 2 others: Have a paging check in icinga for tools home page going down - https://phabricator.wikimedia.org/T116825#1760170 (10Dzahn) a:3yuvipanda [23:16:47] tgr, https://github.com/wikimedia/mediawiki/commit/5b10f2a3a799edc8702efd8d1b692fb3c9946193 [23:18:16] I thought it used to create a gerrit changeset for core [23:18:26] I am probably imagining things [23:19:46] mutante, ergh [23:20:12] Yep, you're imagining it tgr :P [23:20:33] anyone still working on tin? I would add this to the SWAT [23:20:44] not me, go for it [23:20:59] what about the sync to those codfw hosts [23:21:06] it's still going on [23:21:07] we dont have recoveries yet [23:21:17] ok [23:22:31] wait, didn't bd808 say that sync-common is puppetized? [23:23:28] https://github.com/wikimedia/operations-puppet/blob/f9702bbf297e352e97b61ec30aac57c673beaf64/modules/mediawiki/manifests/scap.pp#L51-L62 [23:23:29] tgr: https://phabricator.wikimedia.org/rMW5b10f2a3a799edc8702efd8d1b692fb3c9946193 [23:23:38] oops i was scrolled up :P [23:23:42] (03PS1) 10Alex Monk: labs restbase: Add en.wikivoyage.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/249319 [23:24:30] bd808, that's a one-time thing [23:24:33] MaxSem: Apparently it is only for the initial sync, I thought it was run more often [23:25:05] That makes the missing l10n eariler today even more confusing [23:25:34] That server was nearly in sync with tin (2 files changed) but commented out in the dsh group file [23:27:05] MaxSem: I think I confused myself about scap and trebuchet. Trebuchet does catch up via puppet runs because it can ask the deploy server what tag it should have cloned [23:27:22] scap has no such magic (yet) [23:27:43] 6operations, 7Graphite: grafana access control - https://phabricator.wikimedia.org/T108546#1760190 (10csteipp) [23:27:44] 6operations, 10Security-Reviews, 10Wikimedia-General-or-Unknown: Non-NDA users cannot access graphite.wikimedia.org - https://phabricator.wikimedia.org/T56713#1760191 (10csteipp) [23:28:35] (03PS1) 10Faidon Liambotis: smokeping: remove psw2-eqiad, decom'ed [puppet] - 10https://gerrit.wikimedia.org/r/249323 (https://phabricator.wikimedia.org/T115924) [23:28:54] (03CR) 10Faidon Liambotis: [C: 032] smokeping: remove psw2-eqiad, decom'ed [puppet] - 10https://gerrit.wikimedia.org/r/249323 (https://phabricator.wikimedia.org/T115924) (owner: 10Faidon Liambotis) [23:28:59] (03CR) 10Faidon Liambotis: [V: 032] smokeping: remove psw2-eqiad, decom'ed [puppet] - 10https://gerrit.wikimedia.org/r/249323 (https://phabricator.wikimedia.org/T115924) (owner: 10Faidon Liambotis) [23:34:23] greg-g,i need to deploy a new ver of graphoid service, should i wait until end of swat? [23:34:36] (actually its in 30 min, so probably yes )) [23:35:54] andrewbogott: re: kvm cert. that needs to either be an NRPE check or the cert needs to be copied to neon [23:36:04] 6operations, 10ops-eqiad, 10netops, 5Patch-For-Review: Return psw2-eqiad to spares - https://phabricator.wikimedia.org/T115924#1760217 (10Cmjohnson) psw2 has been disconnected, wiped and added to spares list. [23:36:06] and separately for some reason that check_command is not created yet [23:36:20] 6operations, 10ops-eqiad, 10netops, 5Patch-For-Review: Return psw2-eqiad to spares - https://phabricator.wikimedia.org/T115924#1760218 (10Cmjohnson) 5Open>3Resolved [23:37:59] RECOVERY - Apache HTTP on mw2209 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.434 second response time [23:38:57] RECOVERY - Apache HTTP on mw2128 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.238 second response time [23:38:58] RECOVERY - HHVM rendering on mw2209 is OK: HTTP OK: HTTP/1.1 200 OK - 63799 bytes in 2.569 second response time [23:39:06] 6operations, 6Labs, 10Labs-Infrastructure, 7Monitoring, and 2 others: monitor expiration of labvirt-star SSL cert - https://phabricator.wikimedia.org/T116332#1760227 (10Dzahn) We either need to make this an NRPE task to be executed on the monitored hosts where the certs are, or we need to copy the cert to... [23:39:16] RECOVERY - HHVM rendering on mw2128 is OK: HTTP OK: HTTP/1.1 200 OK - 63799 bytes in 2.786 second response time [23:40:07] RECOVERY - Apache HTTP on mw2050 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.112 second response time [23:40:48] RECOVERY - HHVM rendering on mw2050 is OK: HTTP OK: HTTP/1.1 200 OK - 63798 bytes in 0.277 second response time [23:40:55] (03PS1) 10Faidon Liambotis: Remove psw2-eqiad, decom'ed [dns] - 10https://gerrit.wikimedia.org/r/249326 (https://phabricator.wikimedia.org/T115924) [23:40:59] mutante: ^ [23:41:03] (recoveries) [23:41:18] ori: yep, just saw. thank you [23:41:27] np! [23:41:59] (03CR) 10Faidon Liambotis: [C: 032] Remove psw2-eqiad, decom'ed [dns] - 10https://gerrit.wikimedia.org/r/249326 (https://phabricator.wikimedia.org/T115924) (owner: 10Faidon Liambotis) [23:51:34] ori: still running that sync? [23:51:43] nope [23:51:49] hence the recoveries [23:53:45] !log tgr@tin Started scap: Updating MediaViewer with r246112 [23:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:54:58] you're deploying from svn tgr? [23:56:25] http://i1.kym-cdn.com/photos/images/original/000/209/945/D6PfW.jpg [23:56:58] PROBLEM - puppet last run on mw1178 is CRITICAL: CRITICAL: Puppet has 1 failures [23:57:35] (03PS1) 10Dzahn: labs kvm ssl cert monitoring: fix it [puppet] - 10https://gerrit.wikimedia.org/r/249328 [23:58:21] Krenair: should have been c246112, not sure why I wrote that [23:58:27] :D [23:59:26] (03PS2) 10Dzahn: labs kvm ssl cert monitoring: fix it [puppet] - 10https://gerrit.wikimedia.org/r/249328 (https://phabricator.wikimedia.org/T116332) [23:59:31] more importantly, that's a completely wrong ID [23:59:57] too many open windows