[00:01:10] yep! [00:02:07] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003, stat1002 and bast1001 for Dstrine - https://phabricator.wikimedia.org/T133953#2250343 (10Dzahn) p:05Triage>03Normal [00:02:30] !log Previous deployment: [[Gerrit:279142]] Document FIXME statement in config (no-op) [00:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:02:57] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 02m 25s) [00:03:05] jdlrobson, ^ [00:03:14] !log Previous deployment: [[Gerrit:280865]]+[[Gerrit:285989]] Allow wmf-config/throttle.php to be lenient on ip/IP typo, clean rules (no-op) [00:03:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:03:33] !log Previous deployment: [[Gerrit:252627]] Revert "Increase abusefilter emergency disable threshold on MediaWiki.org" [00:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:03:50] !log Previous deployment: [[Gerrit:285927]] GoogleNewsSitemap configuration (T39608) [00:03:51] T39608: Enable Extension:GoogleNewsSitemap on elwikinews - https://phabricator.wikimedia.org/T39608 [00:03:53] Here you are ori. [00:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:04:01] thanks [00:04:16] !log Previous deployment: [[Gerrit:285553]] Enable lazy loaded references in beta (T129693) [00:04:17] T129693: Lazy load references in mobile beta channel - https://phabricator.wikimedia.org/T129693 [00:04:18] MaxSem: why not? [00:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:06:09] 06Operations, 10Monitoring: Check for an oversized exim4 queue indicating mail delivery failures - https://phabricator.wikimedia.org/T133110#2220489 (10Dzahn) We have this for the lists server. files/icinga/check_mailman_queue modules/role/manifests/lists/server.pp: nrpe_command => '/usr/bin/sudo -u l... [00:07:13] (03PS1) 10Catrope: Remove emailuser override for hewiki, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286105 (https://phabricator.wikimedia.org/T133927) [00:07:25] (03PS2) 10Aaron Schulz: Set "autoResync" on for local-multiwrite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285687 (https://phabricator.wikimedia.org/T128096) [00:08:04] (03PS2) 10Catrope: Remove emailuser override for hewiki, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286105 (https://phabricator.wikimedia.org/T133927) [00:08:42] Dereckson, we now have a ton of "Undefined index: 1 in /srv/mediawiki/php-1.27.0-wmf.21/extensions/CirrusSearch/includes/Hooks.php on line 189" [00:09:13] looks good MaxSem [00:09:57] Bawolff > could it be el.wikinews namespace change? [00:10:16] oh he's not here [00:11:25] mhm, appears one server was off [00:11:36] wmf.21 ? [00:11:37] (639 times) [00:12:15] could be a maintenance script they are running [00:12:31] nothing in the SWAT seems to be Cirrus related [00:12:56] mw1232 [00:14:27] would you have a stacktraceM [00:14:46] RECOVERY - puppet last run on mw1142 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [00:15:37] Dereckson, nvm - kafka pooped itself [00:15:56] that was a message from April 15 finally arriving [00:19:07] (03CR) 10Aaron Schulz: [C: 032] Set "autoResync" on for local-multiwrite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285687 (https://phabricator.wikimedia.org/T128096) (owner: 10Aaron Schulz) [00:19:31] (03Merged) 10jenkins-bot: Set "autoResync" on for local-multiwrite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285687 (https://phabricator.wikimedia.org/T128096) (owner: 10Aaron Schulz) [00:22:22] ACKNOWLEDGEMENT - mysqld processes on holmium is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld cpettet it looks like apparmor is killing mysql here, I believe mysql is from T128737 and is not yet in use so silencing to look at tomorrow [00:22:38] !log aaron@tin Synchronized wmf-config/filebackend-production.php: Set "autoResync" on for local-multiwrite (duration: 02m 29s) [00:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:25:14] MaxSem: you got a kafka trouble? [00:27:37] !log RT - remove libapache2-mod-php5, restart Apache, Perl apps dont need PHP [00:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:30:03] (03PS1) 10Dzahn: RT: don't include PHP, this is Perl [puppet] - 10https://gerrit.wikimedia.org/r/286107 [00:30:55] (03CR) 10Dzahn: [C: 032] "17:32 < mutante> !log RT - remove libapache2-mod-php5, restart Apache, Perl apps dont need PHP" [puppet] - 10https://gerrit.wikimedia.org/r/286107 (owner: 10Dzahn) [00:31:11] MaxSem: done with swat? [00:31:27] AaronSchulz: ori done deploying? [00:31:31] aude: i think so, he's not at his desk atm [00:31:38] ok [00:31:46] i'd like to take care of https://gerrit.wikimedia.org/r/#/c/286087/ [00:31:47] * AaronSchulz isn't doing anything [00:31:49] ok [00:32:04] PROBLEM - puppet last run on ununpentium is CRITICAL: CRITICAL: Puppet has 1 failures [00:33:33] (03PS2) 10Dzahn: RT: don't include PHP, this is Perl [puppet] - 10https://gerrit.wikimedia.org/r/286107 (https://phabricator.wikimedia.org/T119112) [00:33:38] (03PS1) 10Dzahn: RT: include Apache mod rewrite [puppet] - 10https://gerrit.wikimedia.org/r/286108 (https://phabricator.wikimedia.org/T119112) [00:34:13] * aude assumes ori is not there also [00:35:14] (03PS2) 10Dzahn: RT: include Apache mod rewrite [puppet] - 10https://gerrit.wikimedia.org/r/286108 (https://phabricator.wikimedia.org/T119112) [00:35:29] (03CR) 10Dzahn: [C: 032] RT: include Apache mod rewrite [puppet] - 10https://gerrit.wikimedia.org/r/286108 (https://phabricator.wikimedia.org/T119112) (owner: 10Dzahn) [00:56:32] 06Operations: missing /etc/ssl/dhparam.pem on jessie Apache using ssl_ciphersuite - https://phabricator.wikimedia.org/T133966#2250458 (10Dzahn) [01:04:04] * aude having some trouble with jenkins but wants to resolve and still deploy [01:06:55] (03CR) 10Dzahn: "hmm. that's strange. actual role classes should be applied to instances, that's the point of them. but yea, this doesnt keeep me from merg" [puppet] - 10https://gerrit.wikimedia.org/r/285333 (owner: 10Dzahn) [01:07:34] (03PS6) 10Dzahn: phragile: move role class to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/285333 [01:07:43] (03CR) 10Dzahn: [C: 032] phragile: move role class to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/285333 (owner: 10Dzahn) [01:10:08] 06Operations, 06Labs, 10wikitech.wikimedia.org: Rename specific account in LDAP, Wikitech, Gerrit and Phabricator - https://phabricator.wikimedia.org/T85913#957178 (10lfschenone) Hi! I'd like to request a rename too. My global username has recently changed from "Luis Felipe Schenone" to "Felipe Schenone" so... [01:13:48] (03CR) 10Dzahn: "i'd like to say "no-op on phragile-pro" and it is afaict, but puppet is already pre-broken here too :/" [puppet] - 10https://gerrit.wikimedia.org/r/285333 (owner: 10Dzahn) [01:15:04] (03CR) 10Dzahn: "so the role class is not used and what is used is broken. can improve" [puppet] - 10https://gerrit.wikimedia.org/r/285333 (owner: 10Dzahn) [01:15:08] 06Operations, 06Labs, 10wikitech.wikimedia.org: Rename specific account in LDAP, Wikitech, Gerrit and Phabricator - https://phabricator.wikimedia.org/T85913#957178 (10Peachey88) >>! In T85913#2250474, @lfschenone wrote: > Hi! I'd like to request a rename too. My global username has recently changed from "Lui... [01:17:15] 07Puppet, 10Phragile, 06TCB-Team, 07Composer: puppet fail due to composer install on phragile instance - https://phabricator.wikimedia.org/T133967#2250478 (10Dzahn) [01:20:45] 06Operations, 06Labs, 10wikitech.wikimedia.org: Rename specific account in LDAP, Wikitech, Gerrit and Phabricator - https://phabricator.wikimedia.org/T133968#2250492 (10lfschenone) [01:22:22] 06Operations, 06Labs, 10wikitech.wikimedia.org: Rename specific account in LDAP, Wikitech, Gerrit and Phabricator - https://phabricator.wikimedia.org/T133968#2250511 (10lfschenone) [01:22:24] 06Operations, 06Labs, 10wikitech.wikimedia.org: Rename specific account in LDAP, Wikitech, Gerrit and Phabricator - https://phabricator.wikimedia.org/T85913#957178 (10lfschenone) [01:22:27] 06Operations, 06Labs, 10wikitech.wikimedia.org: Rename specific account in LDAP, Wikitech, Gerrit and Phabricator - https://phabricator.wikimedia.org/T133968#2250492 (10lfschenone) [01:23:08] 06Operations, 06Labs, 10wikitech.wikimedia.org: Rename specific account in LDAP, Wikitech, Gerrit and Phabricator - https://phabricator.wikimedia.org/T133968#2250492 (10lfschenone) [01:23:52] 06Operations, 06Labs, 10wikitech.wikimedia.org: Rename specific account in LDAP, Wikitech, Gerrit and Phabricator - https://phabricator.wikimedia.org/T133968#2250515 (10lfschenone) [01:25:34] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003, stat1002 and bast1001 for Dstrine - https://phabricator.wikimedia.org/T133953#2250003 (10Dzahn) @DStrine While we are awaiting manager approval, you could already create a SSH keypair and upload the public part here on the ticket. That would a... [01:34:45] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/2629/" [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [01:35:23] (03PS2) 10Dzahn: install_server: split out reprepro role [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) [01:35:45] (03CR) 10jenkins-bot: [V: 04-1] install_server: split out reprepro role [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [01:51:58] 06Operations, 06Labs, 06Release-Engineering-Team, 10wikitech.wikimedia.org: Rename specific account in LDAP, Wikitech, Gerrit and Phabricator - https://phabricator.wikimedia.org/T133968#2250550 (10Dzahn) [01:55:06] 06Operations: export logs to logstash or create apertium-admins group on sca1001/sca1002 - https://phabricator.wikimedia.org/T89222#2250551 (10Dzahn) hmm. .then duplicate of T97297, suggesting to close [01:57:11] (03PS1) 10BryanDavis: horizon: Enable password autocomplete on login form [puppet] - 10https://gerrit.wikimedia.org/r/286112 [02:13:32] 06Operations, 10Internet-Archive, 10Wikimedia-Planet, 07Upstream: wordpress.com seems to have blocked us from fetching feeds - https://phabricator.wikimedia.org/T133818#2250565 (10Dzahn) Ok, so i tried contacting Wordpress myself but it's nearly impossible. To post a question even on the comunity forums (w... [02:19:59] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.22) (duration: 09m 20s) [02:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:21:16] 06Operations, 10Internet-Archive, 10Wikimedia-Planet, 07Upstream: wordpress.com seems to have blocked us from fetching feeds - https://phabricator.wikimedia.org/T133818#2250569 (10Dzahn) Ah, going to Automattic directly rather than Wordpress.com seems to be a way I pasted my message above into the form ht... [02:22:35] !log krenair@tin Synchronized php-1.27.0-wmf.22/extensions/EventBus: https://gerrit.wikimedia.org/r/286115 (duration: 02m 27s) [02:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:22:51] 06Operations, 10Internet-Archive, 10Wikimedia-Planet, 07Upstream: wordpress.com seems to have blocked us from fetching feeds - https://phabricator.wikimedia.org/T133818#2250570 (10Dzahn) and if that doesn't work we can show up at the Lounge some day :p https://automattic.com/lounge/ [02:26:31] mutante: could probably tweet automatic as well from one of our twitter accounts [02:29:37] !log last deployment was slow because of snapshot1007 being offline, icinga shows it's been like that for the last 7 hours [02:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:44:33] (03PS1) 10BBlack: LE: require sslcert::dhparam [puppet] - 10https://gerrit.wikimedia.org/r/286118 [02:45:16] (03CR) 10BBlack: [C: 032 V: 032] LE: require sslcert::dhparam [puppet] - 10https://gerrit.wikimedia.org/r/286118 (owner: 10BBlack) [02:55:10] PROBLEM - Disk space on elastic1012 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80099 MB (15% inode=99%) [03:05:20] (03PS3) 10BBlack: note future anycast networks [dns] - 10https://gerrit.wikimedia.org/r/286066 (https://phabricator.wikimedia.org/T98006) [03:05:48] (03CR) 10BBlack: [C: 032] note future anycast networks [dns] - 10https://gerrit.wikimedia.org/r/286066 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [03:20:50] PROBLEM - Disk space on elastic1012 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80769 MB (15% inode=99%) [03:33:14] RECOVERY - Disk space on elastic1012 is OK: DISK OK [03:52:53] PROBLEM - puppet last run on db2004 is CRITICAL: CRITICAL: puppet fail [04:19:54] RECOVERY - puppet last run on db2004 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [04:31:24] (03PS1) 10Muehlenhoff: Add list-jobs command to display the job queue [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/286121 [04:39:39] (03CR) 10Andrew Bogott: [C: 031] "This seems fine to me but I want to play with it in labtest before I merge." [puppet] - 10https://gerrit.wikimedia.org/r/286112 (owner: 10BryanDavis) [04:58:33] !log restarting elasticsearch server elastic1013.eqiad.wmnet (T110236) [04:58:34] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [04:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:13:52] 06Operations: missing /etc/ssl/dhparam.pem on jessie Apache using ssl_ciphersuite - https://phabricator.wikimedia.org/T133966#2250627 (10Dzahn) @Bblack already fixed it with https://gerrit.wikimedia.org/r/#/c/286118/ it looks [05:14:36] 06Operations: missing /etc/ssl/dhparam.pem on jessie Apache using ssl_ciphersuite - https://phabricator.wikimedia.org/T133966#2250628 (10Dzahn) 05Open>03Resolved a:03Dzahn yes, he did. thank you! ``` root@ununpentium:~# cat /etc/ssl/dhparam.pem -----BEGIN DH PARAMETERS----- ... ``` [05:16:21] (03CR) 10Dzahn: "resolved https://phabricator.wikimedia.org/T133966" [puppet] - 10https://gerrit.wikimedia.org/r/286118 (owner: 10BBlack) [05:20:47] 06Operations, 13Patch-For-Review: decom magnesium (was: Reinstall magnesium with jessie) - https://phabricator.wikimedia.org/T123713#2250638 (10Dzahn) [05:20:49] 06Operations, 13Patch-For-Review: move RT off of magnesium - https://phabricator.wikimedia.org/T119112#2250637 (10Dzahn) 05stalled>03Open [05:21:21] (03PS1) 10Dzahn: RT: include Apache mod_headers [puppet] - 10https://gerrit.wikimedia.org/r/286127 (https://phabricator.wikimedia.org/T119112) [05:21:44] (03PS2) 10Dzahn: RT: include Apache mod_headers [puppet] - 10https://gerrit.wikimedia.org/r/286127 (https://phabricator.wikimedia.org/T119112) [05:22:11] (03CR) 10Dzahn: [C: 032] RT: include Apache mod_headers [puppet] - 10https://gerrit.wikimedia.org/r/286127 (https://phabricator.wikimedia.org/T119112) (owner: 10Dzahn) [05:23:52] 06Operations: Reprepro should bail if it can't read and sign using the root keys - https://phabricator.wikimedia.org/T116951#2250641 (10MoritzMuehlenhoff) p:05Triage>03Low [05:24:39] (03CR) 10Dzahn: "Apache/Service[apache2]/ensure: ensure changed 'stopped' to 'running'" [puppet] - 10https://gerrit.wikimedia.org/r/286127 (https://phabricator.wikimedia.org/T119112) (owner: 10Dzahn) [05:24:39] RECOVERY - HTTPS on ununpentium is OK: SSL OK - Certificate rt.wikimedia.org valid until 2016-07-27 02:03:35 +0000 (expires in 88 days) [05:25:40] (03CR) 10Dzahn: "< icinga-wm> RECOVERY - HTTPS on ununpentium is OK: SSL OK - Certificate rt.wikimedia.org valid until 2016-07-27 02:03:35 +0000 (expires i" [puppet] - 10https://gerrit.wikimedia.org/r/286127 (https://phabricator.wikimedia.org/T119112) (owner: 10Dzahn) [05:33:20] ACKNOWLEDGEMENT - WDQS HTTP on wdqs1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 416 bytes in 0.040 second response time daniel_zahn scheduled downtime was set [05:33:21] ACKNOWLEDGEMENT - WDQS SPARQL on wdqs1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 416 bytes in 0.012 second response time daniel_zahn scheduled downtime was set [05:34:59] !log snapshot1007 - not reachable, duration 10h [05:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:38:47] 06Operations: Remove unused Samba packages - https://phabricator.wikimedia.org/T132915#2250644 (10MoritzMuehlenhoff) 05Open>03Resolved Samba has been removed (with the exception of the maps cluster, where some OSM libraries depend on libsmbclient) [05:39:31] !log snapshot1007 - was powered down, powering it on. (..connect to mgmt.. "damn it's a HP") [05:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:41:19] RECOVERY - Host snapshot1007 is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [05:41:21] !log re: "02:29 Krenair: last deployment was slow because of snapshot1007 being offline" it's back, i don't know why, it was powered down and i just tried switching it on. that helped. the command is literally "power on" on HP [05:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:42:50] !log restarting elasticsearch server elastic1014.eqiad.wmnet (T110236) [05:42:51] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [05:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:52:18] 06Operations: Remove unused Samba packages - https://phabricator.wikimedia.org/T132915#2250647 (10Dzahn) Cool! Eh.. did you have to change nagios-plugins-standard for that? [05:58:19] 06Operations, 13Patch-For-Review: move RT off of magnesium - https://phabricator.wikimedia.org/T119112#2250651 (10Dzahn) Apache is running now. < icinga-wm> RECOVERY - HTTPS on ununpentium is OK: SSL OK - Certificate rt.wikimedia.org valid until 2016-07-27 02:03:35 +0000 (expires in 88 days) RT is installed... [06:00:54] <_joe_> ununpentium [06:00:59] <_joe_> sigh [06:11:00] _joe_: "Ununpentium has no practical uses yet. It's so unstable that it doesn't stay around long enough to make anything out of it." :p 'night [06:11:36] http://www.newyorker.com/tech/elements/ununpentium-the-newest-element [06:20:56] we should have used that element for etherpad :-) [06:28:33] (03PS1) 10Jcrespo: Repool db1038, increase weight of new hardware slaves db107[4-8] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286129 (https://phabricator.wikimedia.org/T125028) [06:29:46] (03PS2) 10Jcrespo: Repool db1038, increase weight of new hardware slaves db107[4-8] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286129 (https://phabricator.wikimedia.org/T125028) [06:29:51] <_joe_> I'll bbiab [06:30:20] PROBLEM - puppet last run on cp2002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:30] PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:20] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:20] (03CR) 10Jcrespo: [C: 032] Repool db1038, increase weight of new hardware slaves db107[4-8] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286129 (https://phabricator.wikimedia.org/T125028) (owner: 10Jcrespo) [06:31:21] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:40] PROBLEM - puppet last run on pc1006 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:30] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:49] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:14] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1038, increase weight of new hardware slaves db107[4-8] (duration: 00m 33s) [06:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:33:30] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:01] RECOVERY - puppet last run on pc1006 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:37:10] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:44:49] (03PS1) 10Jcrespo: Reduce normal traffic on s2 API servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286130 [06:45:23] (03CR) 10Jcrespo: [C: 032] Reduce normal traffic on s2 API servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286130 (owner: 10Jcrespo) [06:46:38] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Reduce normal traffic on s2 API servers (duration: 00m 27s) [06:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:55:49] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:55:50] RECOVERY - puppet last run on cp2002 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:55:50] RECOVERY - puppet last run on cp1068 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:56:14] RECOVERY - mysqld processes on holmium is OK: PROCS OK: 1 process with command name mysqld [06:56:49] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:58:09] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:50] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:22:56] PROBLEM - Disk space on restbase1014 is CRITICAL: DISK CRITICAL - free space: /srv 187578 MB (3% inode=99%) [07:33:35] (03PS3) 10Elukey: Enable kafka200[12] to host Kafka and EventBus. [puppet] - 10https://gerrit.wikimedia.org/r/285958 (https://phabricator.wikimedia.org/T121558) [07:36:20] (03CR) 10Elukey: [C: 032] Enable kafka200[12] to host Kafka and EventBus. [puppet] - 10https://gerrit.wikimedia.org/r/285958 (https://phabricator.wikimedia.org/T121558) (owner: 10Elukey) [07:36:28] !log stop cleanups on restbase1014-b [07:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:37:05] RECOVERY - Disk space on restbase1014 is OK: DISK OK [07:45:26] PROBLEM - puppet last run on kafka2001 is CRITICAL: CRITICAL: Puppet has 1 failures [07:46:22] checking ---^, it seems to be related to the absence of Event bus, nothing big (I am provisioning those hosts) [07:47:26] RECOVERY - puppet last run on kafka2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:49:15] (03PS2) 10Muehlenhoff: Enable base::firewall on stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/285904 [07:52:06] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable base::firewall on stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/285904 (owner: 10Muehlenhoff) [07:52:37] !log restarting elasticsearch server elastic1015.eqiad.wmnet (T110236) [07:52:37] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [07:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:53:25] PROBLEM - puppet last run on kafka2001 is CRITICAL: CRITICAL: Puppet has 1 failures [07:54:14] !log enabled base::firewall on stat1002 [07:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:57:12] !log puppet disabled on new kafka codfw instances due to errors while starting Event Bus (hosts not in service) [07:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:57:50] (03CR) 10Giuseppe Lavagetto: [C: 031] "I am unsure if this would break existing downloads, but apart from that LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/286068 (https://phabricator.wikimedia.org/T133864) (owner: 10Cscott) [07:58:39] 06Operations, 10OCG-General, 13Patch-For-Review, 05codfw-rollout: Use FQDNs instead of hostnames in the download urls sent to Mediawiki - https://phabricator.wikimedia.org/T133864#2250701 (10Joe) @cscott great, the patch would work as far as puppet is concerned. Would this harm currently cached objects, th... [08:06:26] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [08:20:49] !log restarting elasticsearch server elastic1016.eqiad.wmnet (T110236) [08:20:50] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [08:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:40:22] 06Operations, 10ops-eqiad, 06Analytics-Kanban: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2243191 (10fgiunchedi) >>! In T133785#2249043, @Ottomata wrote: > Ok! We discussed partitioning today. We'd like the following: > > - / a small (30G?) RAID 1 partition on the first 2 dr... [08:41:49] 06Operations, 10ops-eqiad, 06Analytics-Kanban: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2250727 (10JAllemandou) Interesting @fgiunchedi. But what in case of failure, two instances down? [08:46:17] PROBLEM - puppet last run on kafka2001 is CRITICAL: CRITICAL: Puppet has 1 failures [08:46:50] ---^ this is me again [08:52:53] (03PS1) 10Aklapper: List Phabricator projects with only a single workboard column [puppet] - 10https://gerrit.wikimedia.org/r/286133 [08:53:19] (03PS2) 10Aklapper: List Phabricator projects with only a single workboard column [puppet] - 10https://gerrit.wikimedia.org/r/286133 [08:58:15] 06Operations, 10ops-eqiad, 06Analytics-Kanban: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2250788 (10fgiunchedi) @JAllemandou failure of which component? the other different thing for cassandra/restbase in production is that it maximizes available disk space, so ssds there for... [09:01:57] !log changing live configuration of db1049 thread_pool_stall_limit to 10 to test impact on connection timout [09:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:04:37] ACKNOWLEDGEMENT - puppet last run on kafka2001 is CRITICAL: CRITICAL: Puppet has 1 failures Elukey New Service, some errors deploying EventBus [09:19:04] (03PS1) 10Elukey: Add eventbus_codfw to the monitoring hiera variables. [puppet] - 10https://gerrit.wikimedia.org/r/286134 (https://phabricator.wikimedia.org/T121558) [09:27:43] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/2632/neon.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/286134 (https://phabricator.wikimedia.org/T121558) (owner: 10Elukey) [09:35:30] (03PS2) 10Giuseppe Lavagetto: mediawiki::web: drop HHVM define, explicitly block php [puppet] - 10https://gerrit.wikimedia.org/r/285368 (https://phabricator.wikimedia.org/T126310) [09:42:58] !log restarting elasticsearch server elastic1016.eqiad.wmnet (T110236) [09:42:59] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [09:43:03] !log restarting elasticsearch server elastic1017.eqiad.wmnet (T110236) [09:43:04] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [09:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:44:42] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [09:45:43] (03CR) 10Muehlenhoff: "We could also "absent" the PHP packages no longer needed?" [puppet] - 10https://gerrit.wikimedia.org/r/285368 (https://phabricator.wikimedia.org/T126310) (owner: 10Giuseppe Lavagetto) [09:47:32] PROBLEM - Check that eventlogging-service-eventbus is running on kafka2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args /srv/deployment/eventlogging/eventbus/bin/eventlogging-service @/etc/eventlogging.d/services/eventbus [09:48:44] this is me again after fixing icinga [09:49:13] PROBLEM - eventlogging-service-eventbus endpoints health on kafka2002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.16.169, port=8085): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [09:49:33] * elukey turns off alarm in icinga [09:50:23] PROBLEM - eventlogging-service-eventbus endpoints health on kafka2001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.0.139, port=8085): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [09:51:08] 06Operations: Remove unused Samba packages - https://phabricator.wikimedia.org/T132915#2250887 (10MoritzMuehlenhoff) > Cool! Eh.. did you have to change nagios-plugins-standard for that? No, all these packages were installed at the time when apt was still installing recommended packages. That's no longer an... [09:52:22] (03PS2) 10Muehlenhoff: Add salt grain for RT and wire up in debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/285913 [09:53:17] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add salt grain for RT and wire up in debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/285913 (owner: 10Muehlenhoff) [09:53:35] ACKNOWLEDGEMENT - Check that eventlogging-service-eventbus is running on kafka2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args /srv/deployment/eventlogging/eventbus/bin/eventlogging-service @/etc/eventlogging.d/services/eventbus Elukey New service, not taking live traffic. Still working on Event Bus. [09:53:35] ACKNOWLEDGEMENT - eventlogging-service-eventbus endpoints health on kafka2001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.0.139, port=8085): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) Elukey New service, not taking live traffic. Still working on Event Bus. [09:53:35] ACKNOWLEDGEMENT - Check that eventlogging-service-eventbus is running on kafka2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args /srv/deployment/eventlogging/eventbus/bin/eventlogging-service @/etc/eventlogging.d/services/eventbus Elukey New service, not taking live traffic. Still working on Event Bus. [09:53:35] ACKNOWLEDGEMENT - eventlogging-service-eventbus endpoints health on kafka2002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.16.169, port=8085): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) Elukey New service, not taking live traffic. Still working on Event Bus. [10:01:46] 06Operations: add contract end dates to the ops maint & contract gcal - https://phabricator.wikimedia.org/T84585#2250903 (10fgiunchedi) [10:03:26] 06Operations, 10Deployment-Systems, 10Monitoring: [ops] Monitor that LVS config and mw_install are in sync - https://phabricator.wikimedia.org/T25662#2250906 (10fgiunchedi) 05Open>03Invalid we have icinga checks for hosts missing in dsh groups for scap, plus pybal talking to etcd, conftool, and all the rest [10:04:18] 06Operations: deal with puppet's poor habit of spewing file contents to syslog - https://phabricator.wikimedia.org/T81886#2250908 (10fgiunchedi) [10:04:25] <_joe_> wow ticket archeology :P [10:04:38] 06Operations: restrict access to puppet logs - https://phabricator.wikimedia.org/T84242#924814 (10fgiunchedi) [10:04:58] hold my hardhat, I'm going in! [10:05:42] <_joe_> godog: http://4.bp.blogspot.com/-0KGktQ7BCp8/TvP9DdpI7vI/AAAAAAAAEs4/OXRlyhj2bGw/s1600/IJ_rockroll-cropped.gif [10:05:59] heheh something like that [10:09:19] _joe_: not related but any idea why https://gerrit.wikimedia.org/r/#/c/285961/2 would fail with https://puppet-compiler.wmflabs.org/2633/bromine.eqiad.wmnet/change.bromine.eqiad.wmnet.err ? I'm trying to get rid of secret fileserver usage [10:09:52] <_joe_> godog: let me take a look [10:10:44] <_joe_> uh that looks like completely unrelated to your change? [10:11:36] could be, it is under the "Hosts that fail to compile when the change is applied [10:11:52] though [10:11:56] <_joe_> yeah so... [10:12:09] <_joe_> no immediate idea sorry [10:12:41] np, seems odd [10:12:47] <_joe_> oh I see [10:12:51] <_joe_> no it's not [10:13:05] <_joe_> you should use secret() explicitly [10:13:29] <_joe_> and maybe not if $gpg_pubring != undef [10:13:40] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2250942 (10jcrespo) A bit offtopic, but Re: > right now requires RO on MediaWiki, the procedure can surely be improved I think something along the lines of "master-master" (galera?) with a... [10:15:28] _joe_: not sure what you mean explicitly, as a default? [10:15:33] <_joe_> let me take a look [10:16:41] thanks! [10:18:25] <_joe_> godog: so in releases::reprepro which is included in bromine [10:18:55] <_joe_> sorry, releases::reprepro::upload [10:20:27] <_joe_> err, I was right the first time [10:21:02] <_joe_> so, you are passing to the class reprepro a binary sequence [10:21:15] <_joe_> when you do if $gpg_pubring != undef [10:21:17] <_joe_> later [10:21:35] <_joe_> it tries to evaluate the equality in string context, I think [10:21:46] <_joe_> so it tries to decode an utf8 string [10:21:49] <_joe_> and that fails [10:22:48] <_joe_> does that sound correct? [10:23:19] yeah that'd make sense I think, how to handle this case though? [10:25:06] <_joe_> godog: I would pass the name of the secret to the reprepro class [10:25:15] <_joe_> instead than the secret itself [10:26:21] and ditto for the public keyring, ok I'll try that thanks! [10:31:02] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 17 failures [10:37:18] (03PS3) 10Filippo Giunchedi: releases: use secret() for gpg keyring [puppet] - 10https://gerrit.wikimedia.org/r/285961 [10:38:07] (03PS1) 10Jcrespo: Repool db2047 and db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286138 [10:39:58] RECOVERY - Check that eventlogging-service-eventbus is running on kafka2001 is OK: PROCS OK: 1 process with command name python, args /srv/deployment/eventlogging/eventbus/bin/eventlogging-service @/etc/eventlogging.d/services/eventbus [10:40:07] RECOVERY - puppet last run on kafka2001 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [10:40:43] (03PS1) 10Giuseppe Lavagetto: mediawiki::web: switch globally to use the worker mpm instead of prefork [puppet] - 10https://gerrit.wikimedia.org/r/286139 [10:43:17] 06Operations, 10puppet-compiler: puppet compiler error on catalog with non-ascii output - https://phabricator.wikimedia.org/T133979#2251021 (10fgiunchedi) [10:43:27] 06Operations, 10puppet-compiler: puppet compiler error on catalog with non-ascii output - https://phabricator.wikimedia.org/T133979#2251033 (10fgiunchedi) p:05Triage>03Low [10:43:35] _joe_: it works, thanks! [10:46:58] 06Operations, 10DBA: setup/install/deploy db2033 - https://phabricator.wikimedia.org/T122998#2251040 (10jcrespo) Let's use it as the the x1 server and decom db2008 and db2009. [10:47:15] 06Operations, 10DBA: setup/install/deploy db2033 - https://phabricator.wikimedia.org/T122998#2251042 (10jcrespo) [11:04:19] (03CR) 10Giuseppe Lavagetto: [C: 031] "https://puppet-compiler.wmflabs.org/2635/" [puppet] - 10https://gerrit.wikimedia.org/r/286139 (owner: 10Giuseppe Lavagetto) [11:04:59] 06Operations, 10ops-codfw, 06Analytics-Kanban, 06DC-Ops, and 5 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#2251078 (10mobrovac) [11:10:28] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 520 (expecting: 200) [11:11:11] <_joe_> mobrovac: ^^ sounds familiar? [11:11:28] * mobrovac runs away as far as possible [11:11:44] oh but wait [11:11:48] that's not the usual erroir [11:11:53] <_joe_> nope it's not [11:11:58] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 520 (expecting: 200) [11:11:58] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 520 (expecting: 200) [11:11:58] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 520 (expecting: 200) [11:12:01] will look into it [11:13:57] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [11:17:58] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [11:17:58] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [11:17:58] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [11:18:07] * mobrovac sighs [11:18:33] we have to completely rethink these citoid checks [11:18:49] pbs.org is now having problems so the test fails [11:19:58] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 520 (expecting: 200) [11:22:19] ACKNOWLEDGEMENT - citoid endpoints health on scb1001 is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 520 (expecting: 200) Marko Obrovac pbs.org is failing to load [11:22:19] ACKNOWLEDGEMENT - citoid endpoints health on scb1002 is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 520 (expecting: 200) Marko Obrovac pbs.org is failing to load [11:22:19] ACKNOWLEDGEMENT - citoid endpoints health on scb2001 is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 520 (expecting: 200) Marko Obrovac pbs.org is failing to load [11:22:19] ACKNOWLEDGEMENT - citoid endpoints health on scb2002 is CRITICAL: /api (open graph via native scraper) is CRITICAL: Test open graph via native scraper returned the unexpected status 520 (expecting: 200) Marko Obrovac pbs.org is failing to load [11:23:49] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:23:49] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:24:33] (03PS1) 10Muehlenhoff: Assign salt grains for webperf role and wire up in debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/286140 [11:28:16] (03CR) 10Mobrovac: [C: 04-1] "service::node provides most of the config, we only need the spec part here." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/285678 (https://phabricator.wikimedia.org/T133221) (owner: 10Ppchelko) [11:28:37] (03CR) 10Mobrovac: "For reference, see citoid's or mathoid's config files." [puppet] - 10https://gerrit.wikimedia.org/r/285678 (https://phabricator.wikimedia.org/T133221) (owner: 10Ppchelko) [11:28:48] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for webperf role and wire up in debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/286140 (owner: 10Muehlenhoff) [11:30:08] PROBLEM - HHVM rendering on mw1119 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:30:38] PROBLEM - Apache HTTP on mw1119 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:30:59] PROBLEM - Check size of conntrack table on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:31:08] PROBLEM - configured eth on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:31:09] PROBLEM - dhclient process on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:31:18] PROBLEM - DPKG on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:31:19] are there any known issues with mailserver? [11:31:33] or maybe phabricator [11:31:37] PROBLEM - nutcracker port on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:31:37] PROBLEM - Disk space on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:31:38] PROBLEM - RAID on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:31:57] PROBLEM - nutcracker process on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:32:04] i haven't received any phabricator notice [11:32:10] for couple days [11:33:08] PROBLEM - HHVM processes on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:33:49] PROBLEM - SSH on mw1119 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:34:47] PROBLEM - salt-minion processes on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:36:17] Danny_B: check your spam box maybe? [11:36:21] I have been [11:36:28] Danny_B: what email provider do you use? [11:37:52] p858snake: checking spam is of course the very first what i always do before reporting possible email issues ;-) [11:38:07] checking mw1119 [11:39:35] !log soft reboot for mw1119 (not responsive to ssh, root login timed out on the console) [11:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:41:27] RECOVERY - DPKG on mw1119 is OK: All packages OK [11:41:28] steady ramp up during the past days https://grafana.wikimedia.org/dashboard/db/server-board?panelId=14&fullscreen [11:41:38] RECOVERY - SSH on mw1119 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [11:41:48] RECOVERY - nutcracker port on mw1119 is OK: TCP OK - 0.000 second response time on port 11212 [11:41:58] RECOVERY - Disk space on mw1119 is OK: DISK OK [11:42:07] RECOVERY - RAID on mw1119 is OK: OK: no RAID installed [11:42:08] RECOVERY - Apache HTTP on mw1119 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 3.531 second response time [11:42:19] RECOVERY - Check size of conntrack table on mw1119 is OK: OK: nf_conntrack is 0 % full [11:42:28] RECOVERY - salt-minion processes on mw1119 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:42:29] RECOVERY - nutcracker process on mw1119 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [11:42:38] RECOVERY - HHVM processes on mw1119 is OK: PROCS OK: 6 processes with command name hhvm [11:42:48] RECOVERY - configured eth on mw1119 is OK: OK - interfaces up [11:42:58] RECOVERY - HHVM rendering on mw1119 is OK: HTTP OK: HTTP/1.1 200 OK - 71415 bytes in 6.531 second response time [11:42:59] RECOVERY - dhclient process on mw1119 is OK: PROCS OK: 0 processes with command name dhclient [11:46:08] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [11:49:11] 06Operations, 06Labs, 10Labs-Infrastructure: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#2251099 (10MoritzMuehlenhoff) I'll build a backport of 2.4.41. [11:49:37] <_joe_> elukey: the grafana links like that don't really work [11:50:40] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add list-jobs command to display the job queue [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/286121 (owner: 10Muehlenhoff) [11:51:16] 06Operations, 10ops-codfw, 06Analytics-Kanban, 06DC-Ops, and 5 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#2251101 (10elukey) - Icinga configuration updated - Added kafka200[12] to eventbus scap config (thanks to Marko) - Run puppet on both nodes, no errors Next steps:... [11:52:42] (03PS1) 10Giuseppe Lavagetto: puppet: install msgpack and allow switching it on/off [puppet] - 10https://gerrit.wikimedia.org/r/286141 [11:53:07] <_joe_> hashar, mobrovac ^^ [11:53:20] <_joe_> I am cherry-picking that on the beta puppetmaster [11:53:23] looking [11:53:24] kk [11:53:40] (03PS2) 10Giuseppe Lavagetto: puppet: install msgpack and allow switching it on/off [puppet] - 10https://gerrit.wikimedia.org/r/286141 [11:54:41] _joe_ yeah sorry I noticed right after pressing enter, but it is easy to switch to mw1119 anyway [11:55:18] (03CR) 10Mobrovac: puppet: install msgpack and allow switching it on/off (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/286141 (owner: 10Giuseppe Lavagetto) [12:00:41] (03CR) 10Giuseppe Lavagetto: puppet: install msgpack and allow switching it on/off (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/286141 (owner: 10Giuseppe Lavagetto) [12:04:37] (03PS1) 10Yuvipanda: Don't try to use /usr/local/bin/webservice [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/286143 [12:04:41] (03PS3) 10Giuseppe Lavagetto: puppet: install msgpack and allow switching it on/off [puppet] - 10https://gerrit.wikimedia.org/r/286141 [12:06:52] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [12:07:13] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [12:07:23] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [12:13:24] (03PS1) 10Muehlenhoff: Enable base::firewall on labtestcontrol2001 [puppet] - 10https://gerrit.wikimedia.org/r/286145 [12:19:49] 06Operations: identify physical disks for cisco machines - https://phabricator.wikimedia.org/T84981#2251145 (10fgiunchedi) 05Open>03declined we're going to decommission ciscos, see {T128821} [12:21:40] 06Operations: decrease negative cache TTL for lookups from MTA to google (WMF google apps) - https://phabricator.wikimedia.org/T84600#2251149 (10fgiunchedi) [12:24:14] (03PS1) 10Dereckson: Apache: redirect pk.wikimedia.org to wikimediapakistan.org [puppet] - 10https://gerrit.wikimedia.org/r/286147 (https://phabricator.wikimedia.org/T56780) [12:28:26] 06Operations: decrease negative cache TTL for lookups from MTA to google (WMF google apps) - https://phabricator.wikimedia.org/T84600#2251156 (10fgiunchedi) @bbogaert resurrecting this old task, have you come across this problem recently? namely newly created accounts being able to receive mail two hours after c... [12:28:33] 06Operations: "pxe boot once" option for HP servers - https://phabricator.wikimedia.org/T89443#2251158 (10fgiunchedi) 05Open>03Invalid indeed, IIRC also the same works via `wmf-reimage` (i.e. using ipmi) [12:29:47] 06Operations, 05Security: Document Debian/Ubuntu security update procedure & command - https://phabricator.wikimedia.org/T88469#2251162 (10fgiunchedi) 05Open>03declined in the meantime it is possible to use `debdeploy` from @MoritzMuehlenhoff to perform (security) upgrades across the fleet [12:30:15] !log restarting elasticsearch server elastic1018.eqiad.wmnet (T110236) [12:30:16] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [12:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:31:20] (03PS1) 10Muehlenhoff: Enable base::firewall on potassium [puppet] - 10https://gerrit.wikimedia.org/r/286148 [12:34:18] (03PS5) 10Muehlenhoff: jsbench: Add ferm rules for xvfb [puppet] - 10https://gerrit.wikimedia.org/r/282318 [12:35:50] 06Operations, 10scap: Decide on /var/lib vs /home as locations of homedir for mwdeploy - https://phabricator.wikimedia.org/T86971#2251172 (10fgiunchedi) looping in #scap since it also belongs there [12:36:08] (03CR) 10Muehlenhoff: [C: 032 V: 032] jsbench: Add ferm rules for xvfb [puppet] - 10https://gerrit.wikimedia.org/r/282318 (owner: 10Muehlenhoff) [12:40:30] _joe_: neat :) [12:47:20] 06Operations, 07Puppet, 07HHVM: Tighten permissions on HHVM bytecode cache - https://phabricator.wikimedia.org/T85990#2251185 (10fgiunchedi) this is still the case ``` mw1015:~$ ls -la /var/cache/hhvm/ total 263776 drwxr-xr-x 2 www-data www-data 4096 Apr 29 12:05 . drwxr-xr-x 15 root root... [12:50:21] 06Operations, 10Traffic, 07HTTPS: Getting ssl_error_inappropriate_fallback_alert very rarely - https://phabricator.wikimedia.org/T108579#1523931 (10fgiunchedi) @DaBPunkt are you still getting the same sporadic error? [12:54:52] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 638 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5174695 keys - replication_delay is 638 [13:01:02] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5141956 keys - replication_delay is 0 [13:02:47] 06Operations, 06Analytics-Kanban, 13Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2251236 (10elukey) [13:02:51] 06Operations, 06Analytics-Kanban, 13Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#798408 (10elukey) [13:03:01] 06Operations, 10ops-codfw, 06Analytics-Kanban, 13Patch-For-Review: rack/setup/deploy conf200[123] - https://phabricator.wikimedia.org/T131959#2251238 (10elukey) [13:09:35] !log restarting elasticsearch server elastic1019.eqiad.wmnet (T110236) [13:09:36] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [13:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:12:38] (03PS1) 10Mobrovac: Changeprop: Set hyperswitch as the start-up module [puppet] - 10https://gerrit.wikimedia.org/r/286153 [13:20:52] 06Operations, 10MediaWiki-extensions-CodeReview, 10Traffic, 07HTTPS: Provide HTTPS links in CodeReview emails - https://phabricator.wikimedia.org/T31008#2251262 (10fgiunchedi) indeed this looks closable/declined to me, @brion @Krinkle @siebrand ? [13:28:02] PROBLEM - puppet last run on elastic1001 is CRITICAL: CRITICAL: Puppet has 1 failures [13:28:44] 06Operations, 10MediaWiki-extensions-CodeReview, 10Traffic, 07HTTPS: Provide HTTPS links in CodeReview emails - https://phabricator.wikimedia.org/T31008#2251282 (10yuvipanda) 05Open>03declined Let's do it! [13:29:07] (03PS1) 10Jcrespo: Prepare db2033 for jessie reimage [puppet] - 10https://gerrit.wikimedia.org/r/286155 [13:30:03] RECOVERY - puppet last run on elastic1001 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [13:30:27] Puppet issue above seems transient... having a look... [13:32:00] (03CR) 10Jcrespo: [C: 032] Prepare db2033 for jessie reimage [puppet] - 10https://gerrit.wikimedia.org/r/286155 (owner: 10Jcrespo) [13:32:03] Confirmed, that was a transient proxy error conencting to puppet master [13:39:02] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 654 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5141712 keys - replication_delay is 654 [13:39:45] (03PS4) 10Filippo Giunchedi: releases: use secret() for gpg keyring [puppet] - 10https://gerrit.wikimedia.org/r/285961 [13:39:48] !log reimaging db2033 [13:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:41:08] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2251289 (10elukey) Followed https://wikitech.wikimedia.org/wiki/Building_OpenJDK_8_backports On copper: ``` elukey@copper:~$ ls /var/cache/pbuilder/result/jes... [13:42:18] 06Operations, 10DBA: setup/install/deploy db2033 - https://phabricator.wikimedia.org/T122998#2251291 (10jcrespo) [13:43:08] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2251292 (10elukey) Next steps: 1) @ori, @elukey to evaluate memcached on mc2009 and mc1009 possibly. 2) if everything looks good, @elukey to double check with... [13:45:02] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5140877 keys - replication_delay is 0 [13:45:40] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] releases: use secret() for gpg keyring [puppet] - 10https://gerrit.wikimedia.org/r/285961 (owner: 10Filippo Giunchedi) [13:52:12] PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: puppet fail [13:54:30] !log stopping mysql db2008 (cloning to db2033) [13:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:57:45] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 10.34% of data above the critical threshold [106250000.0] [14:02:01] (03CR) 10Nethahussain: "Moved Netha Hussain's blog from blogspot to wordpress. The enplanet link is not working now. The new link is https://nethahussain.wordpres" [puppet] - 10https://gerrit.wikimedia.org/r/102210 (owner: 10Nemo bis) [14:05:26] (03PS1) 10Filippo Giunchedi: Revert "releases: use secret() for gpg keyring" [puppet] - 10https://gerrit.wikimedia.org/r/286158 [14:06:35] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Revert "releases: use secret() for gpg keyring" [puppet] - 10https://gerrit.wikimedia.org/r/286158 (owner: 10Filippo Giunchedi) [14:08:13] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [14:13:11] (03PS2) 10Hashar: beta: drop references to ArticleCreationWorkflow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278883 [14:16:27] (03PS2) 10Hashar: beta: drop mobile cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283986 (https://phabricator.wikimedia.org/T130473) [14:19:20] (03PS1) 10Hashar: beta: fix purges for upload.beta.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286159 [14:20:43] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [93750000.0] [14:21:32] (03CR) 10Hashar: [C: 032] beta: drop references to ArticleCreationWorkflow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278883 (owner: 10Hashar) [14:21:39] (03CR) 10Hashar: [C: 032] beta: fix purges for upload.beta.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286159 (owner: 10Hashar) [14:21:56] (03Merged) 10jenkins-bot: beta: drop references to ArticleCreationWorkflow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278883 (owner: 10Hashar) [14:25:03] (03CR) 10Hashar: [C: 032] beta: drop mobile cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283986 (https://phabricator.wikimedia.org/T130473) (owner: 10Hashar) [14:25:03] (03Merged) 10jenkins-bot: beta: drop mobile cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283986 (https://phabricator.wikimedia.org/T130473) (owner: 10Hashar) [14:25:03] (03Merged) 10jenkins-bot: beta: fix purges for upload.beta.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286159 (owner: 10Hashar) [14:26:53] !log Rebased tin:/srv/mediawiki-staging 31886c7..8e2670a . Bring in 3 changes that are solely for beta cluster. [14:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:30:42] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "Puppet fails if we activate msgpack dispatching" [puppet] - 10https://gerrit.wikimedia.org/r/286141 (owner: 10Giuseppe Lavagetto) [14:30:54] (03PS2) 10Giuseppe Lavagetto: mediawiki::web: switch globally to use the worker mpm instead of prefork [puppet] - 10https://gerrit.wikimedia.org/r/286139 [14:32:56] !log restarting elasticsearch server elastic1020.eqiad.wmnet (T110236) [14:32:57] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [14:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:33:54] 06Operations, 10Math: Install texlive-extra-utils on mw appservers - https://phabricator.wikimedia.org/T109195#2251364 (10Physikerwelt) 05Open>03declined Please open a new bug with a list of commands and their applications, which do not work with the new rendering mode. [14:36:30] (03PS4) 10Hashar: contint: clean up role::ci::slave [puppet] - 10https://gerrit.wikimedia.org/r/282322 [14:36:57] (03PS3) 10Hashar: contint: move npmtravis out of prod slave [puppet] - 10https://gerrit.wikimedia.org/r/282323 (https://phabricator.wikimedia.org/T114421) [14:37:15] 06Operations, 10Math: Install texlive-extra-utils on mw appservers - https://phabricator.wikimedia.org/T109195#2251386 (10Physikerwelt) [14:37:24] (03PS2) 10Jcrespo: Repool db2047 and db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286138 [14:37:26] (03PS1) 10Jcrespo: Depool db2008, db2009. Pool db2033 as the new x1 node [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286163 (https://phabricator.wikimedia.org/T122998) [14:38:01] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web: switch globally to use the worker mpm instead of prefork [puppet] - 10https://gerrit.wikimedia.org/r/286139 (owner: 10Giuseppe Lavagetto) [14:38:55] (03CR) 10Hashar: "Compiled https://puppet-compiler.wmflabs.org/2637/" [puppet] - 10https://gerrit.wikimedia.org/r/282322 (owner: 10Hashar) [14:39:03] (03PS1) 10Dzahn: syslog: move role class to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/286164 [14:39:05] (03PS1) 10Dzahn: snapshot: one file per role class, move to modules/role [puppet] - 10https://gerrit.wikimedia.org/r/286165 [14:39:39] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review, 05Security: define in Puppet or remove user account - tnegrin - https://phabricator.wikimedia.org/T90932#2251391 (10Tnegrin) I haven't logged into anything except my MacBook in months so I think you're good to go! [14:39:59] !log oblivian@palladium conftool action : set/pooled=no; selector: name=mw1153.eqiad.wmnet [14:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:41:35] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 3 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [14:41:45] 06Operations, 10ops-eqiad, 06DC-Ops: eqiad: Rack and setup new labstore - https://phabricator.wikimedia.org/T133397#2251392 (10Cmjohnson) [X] - receive in normally [X] - rack [] - update racktables [] - label servers [X] - cabling [X] - setup bios/idrac/ilom [X] - add mgmt dns entries for both asset tag an... [14:41:47] 06Operations, 10ops-eqiad, 06DC-Ops: eqiad: Rack and setup new labstore - https://phabricator.wikimedia.org/T133397#2251393 (10Cmjohnson) [14:41:49] 06Operations, 10ops-eqiad, 06DC-Ops: eqiad: Rack and setup new labstore - https://phabricator.wikimedia.org/T133397#2251395 (10Cmjohnson) [14:41:51] 06Operations, 10ops-eqiad, 06DC-Ops: eqiad: Rack and setup new labstore - https://phabricator.wikimedia.org/T133397#2251397 (10Cmjohnson) [14:41:52] is one of those mine? [14:42:06] (the mediawiki-config) [14:44:03] jynus: yeah forgot to sync [14:44:04] blbl [14:44:51] 06Operations, 10ops-eqiad: eqiad: Failed DIMM db1065 - https://phabricator.wikimedia.org/T133250#2251403 (10jcrespo) @Cmjohnson Sorry, I think I completely forgot and ignored you last day. Next Tuesday? [14:45:12] I dont get scap sync-master, so I will just rebase on mira [14:45:27] (03PS3) 10Dzahn: admin: remove access for tnegrin pt1 [puppet] - 10https://gerrit.wikimedia.org/r/285898 (https://phabricator.wikimedia.org/T90932) [14:45:35] jynus: done [14:45:51] oh, not mine, I thought *I* was the one I forgot [14:45:54] (03CR) 10Dzahn: [C: 032] admin: remove access for tnegrin pt1 [puppet] - 10https://gerrit.wikimedia.org/r/285898 (https://phabricator.wikimedia.org/T90932) (owner: 10Dzahn) [14:46:23] (03PS1) 10Giuseppe Lavagetto: imagescaler: explicitly configure apache for mpm worker [puppet] - 10https://gerrit.wikimedia.org/r/286166 [14:46:32] the 3 unmerged ones on mira were mine for sure [14:47:16] I prepared some changes, and doing mysql stuff at the same time, so I forget sometime [14:47:17] s [14:47:24] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [14:47:31] 06Operations, 10ops-codfw: rack/setup/deploy restbase200[7-9] switch configuration - https://phabricator.wikimedia.org/T133788#2251407 (10Papaul) @RobH can you lease check network switch settings again for me. all 3 servers are failing on the network configuration settings step. see message below. Thanks.... [14:47:45] 06Operations, 10ops-codfw: rack/setup/deploy restbase200[7-9] switch configuration - https://phabricator.wikimedia.org/T133788#2251408 (10Papaul) 05Resolved>03Open [14:47:47] 06Operations, 10ops-codfw: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2251409 (10Papaul) [14:48:34] (03PS2) 10Giuseppe Lavagetto: imagescaler: explicitly configure apache for mpm worker [puppet] - 10https://gerrit.wikimedia.org/r/286166 [14:48:58] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] imagescaler: explicitly configure apache for mpm worker [puppet] - 10https://gerrit.wikimedia.org/r/286166 (owner: 10Giuseppe Lavagetto) [14:49:19] 06Operations, 06Discovery, 10Kartotherian, 10Maps, and 2 others: Verify maps caching - https://phabricator.wikimedia.org/T133988#2251410 (10Yurik) [14:52:08] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review, 05Security: define in Puppet or remove user account - tnegrin - https://phabricator.wikimedia.org/T90932#2251424 (10Dzahn) Gotcha! .. and done stat1001 - Notice: /Stage[main]/Ssh::Server/File[/etc/ssh/userkeys/tnegrin]/ensure: removed bast1001 - No... [14:52:49] (03PS2) 10Dzahn: admin: remove access for tnegrin pt2 [puppet] - 10https://gerrit.wikimedia.org/r/285899 (https://phabricator.wikimedia.org/T90932) [14:54:03] !log moving topology of db2033 to be the new x1 master on codfw [14:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:54:14] 06Operations, 06Discovery, 10Kartotherian, 10Maps, and 2 others: Verify maps caching - https://phabricator.wikimedia.org/T133988#2251425 (10Yurik) [14:54:42] (03CR) 10Dzahn: [C: 032] admin: remove access for tnegrin pt2 [puppet] - 10https://gerrit.wikimedia.org/r/285899 (https://phabricator.wikimedia.org/T90932) (owner: 10Dzahn) [14:56:43] !log oblivian@palladium conftool action : set/pooled=yes; selector: name=mw1153.eqiad.wmnet [14:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:57:49] 06Operations, 05Security: Define in Puppet or remove rogue user accounts not currently defined in admin/data.yaml - https://phabricator.wikimedia.org/T90923#2251429 (10Dzahn) [14:57:51] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review, 05Security: define in Puppet or remove user account - tnegrin - https://phabricator.wikimedia.org/T90932#2251428 (10Dzahn) 05Open>03Resolved [15:00:33] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures [15:02:29] checking --^ [15:03:34] just ran manually puppet agent -tv and everything worked, weird [15:04:30] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:04:49] <_joe_> elukey: what does /var/log/puppet.log tell you? [15:06:22] _joe_ a timeout occurred for a command [15:13:00] PROBLEM - puppet last run on mw2211 is CRITICAL: CRITICAL: Puppet has 1 failures [15:17:08] !log restarting elasticsearch server elastic1021.eqiad.wmnet (T110236) [15:17:12] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [15:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:21:37] 06Operations, 10Traffic: Investigate better DNS cache/lookup solutions - https://phabricator.wikimedia.org/T104442#2251455 (10BBlack) [15:22:07] (03PS1) 10Urbanecm: Creation of page mover userright [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286168 (https://phabricator.wikimedia.org/T133981) [15:23:04] (03PS2) 10Alex Monk: Creation of page mover userright on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286168 (https://phabricator.wikimedia.org/T133981) (owner: 10Urbanecm) [15:24:46] (03PS3) 10Jcrespo: Repool db2047 and db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286138 (https://phabricator.wikimedia.org/T132011) [15:24:50] (03PS3) 10Urbanecm: Creation of page mover userright [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286168 (https://phabricator.wikimedia.org/T133981) [15:25:06] (03CR) 10Jcrespo: [C: 032 V: 032] Repool db2047 and db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286138 (https://phabricator.wikimedia.org/T132011) (owner: 10Jcrespo) [15:25:16] (03PS2) 10Jcrespo: Depool db2008, db2009. Pool db2033 as the new x1 node [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286163 (https://phabricator.wikimedia.org/T122998) [15:26:20] (03PS4) 10Urbanecm: Creation of page mover userright [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286168 (https://phabricator.wikimedia.org/T133981) [15:26:31] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 06Labs, 10hardware-requests: eqiad: (2) Relevance forge servers - https://phabricator.wikimedia.org/T131184#2251477 (10chasemp) Does this usurp nobelium so we can decomission it in its labs support role? I assumed yes but want to make sure. Just a... [15:26:49] (03PS3) 10Jcrespo: Depool db2008, db2009. Pool db2033 as the new x1 node [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286163 (https://phabricator.wikimedia.org/T122998) [15:26:51] (03PS5) 10Alex Monk: Creation of page mover userright on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286168 (https://phabricator.wikimedia.org/T133981) (owner: 10Urbanecm) [15:27:46] (03CR) 10Jcrespo: [C: 032] Depool db2008, db2009. Pool db2033 as the new x1 node [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286163 (https://phabricator.wikimedia.org/T122998) (owner: 10Jcrespo) [15:29:05] (03CR) 10Urbanecm: "PS3 and PS4: Adding move permission to page mover user group (the first wasn't working)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286168 (https://phabricator.wikimedia.org/T133981) (owner: 10Urbanecm) [15:29:42] (03CR) 10Dzahn: "ERROR 1146 (42S02): Table 'phabricator_maniphest.project_column' doesn't exist" [puppet] - 10https://gerrit.wikimedia.org/r/286133 (owner: 10Aklapper) [15:29:49] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2047 and db2068. Depool db2008, db2009. Pool db2033 as the new x1 node. (duration: 00m 27s) [15:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:36:47] (03CR) 10Jforrester: [C: 04-1] "This commit title doesn't tell me which wiki it's for; it implicitly says it's for all wikis, which is Unhelpfulâ„¢. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286168 (https://phabricator.wikimedia.org/T133981) (owner: 10Urbanecm) [15:36:54] 06Operations, 10DBA, 13Patch-For-Review: Reimage db2047 - check for hardware errors - https://phabricator.wikimedia.org/T132011#2251503 (10jcrespo) 05Open>03Resolved a:03jcrespo I have repooled the server, but feel free to still give it a second check and reopen if you see something weird. [15:37:08] 07Puppet, 10Beta-Cluster-Infrastructure: deployment-puppetmaster puppet fails due to "Could not render to Puppet::Network::Format[msgpack]: undefined method `to_msgpack' for #" - https://phabricator.wikimedia.org/T133989#2251508 (10Krenair) [15:38:11] RECOVERY - puppet last run on mw2211 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:38:21] 06Operations, 10DBA, 13Patch-For-Review: setup/install/deploy db2033 - https://phabricator.wikimedia.org/T122998#2251523 (10jcrespo) [15:38:34] 06Operations, 10DBA, 13Patch-For-Review: setup/install/deploy db2033 - https://phabricator.wikimedia.org/T122998#1918926 (10jcrespo) 05Open>03Resolved Implemented as x1 node on codfw to be able to decom db2008 and db2009. [15:38:49] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Add RelEng to contint-roots - https://phabricator.wikimedia.org/T133990#2251528 (10thcipriani) [15:39:12] (03CR) 10Aklapper: [C: 04-1] "Urgh. :( Let me update my local instance..." [puppet] - 10https://gerrit.wikimedia.org/r/286133 (owner: 10Aklapper) [15:39:55] (03CR) 10Alex Monk: "I added the wiki to the commit message." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286168 (https://phabricator.wikimedia.org/T133981) (owner: 10Urbanecm) [15:40:00] (03CR) 10Urbanecm: "Sorry for bad message, Alex added it (thanks!)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286168 (https://phabricator.wikimedia.org/T133981) (owner: 10Urbanecm) [15:41:20] (03CR) 10Urbanecm: "By the way, in T133981's title this is called page mover, so I named it like this in conf." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286168 (https://phabricator.wikimedia.org/T133981) (owner: 10Urbanecm) [15:41:28] 07Puppet, 10Beta-Cluster-Infrastructure: Setup puppet exported resources to collect ssh host keys for beta - https://phabricator.wikimedia.org/T72792#2251546 (10mmodell) Generating the keys on the puppetmaster and distributing them as regular files would be easy enough. No need for exported resources. E.g. th... [15:41:36] 06Operations, 10DBA: Investigate/decom db2001-db2009 - https://phabricator.wikimedia.org/T125827#2251547 (10jcrespo) [15:43:45] 06Operations, 10DBA: Investigate/decom db2001-db2009 - https://phabricator.wikimedia.org/T125827#2251552 (10jcrespo) db2008 and db2009 are in theory still in use, but ready to be decommed as they have been substituted by the larger db2033. [15:44:06] 06Operations, 10OCG-General, 13Patch-For-Review, 05codfw-rollout: Use FQDNs instead of hostnames in the download urls sent to Mediawiki - https://phabricator.wikimedia.org/T133864#2251553 (10cscott) I thought about that last time (after I'd submitted the patch) and came to the conclusion that it wouldn't.... [15:44:30] (03CR) 10Jforrester: "The community get to choose the i18n of the name, not the config file name. This is meant to be useful for devs/opsen reading the config, " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286168 (https://phabricator.wikimedia.org/T133981) (owner: 10Urbanecm) [15:47:36] (03PS1) 10Jcrespo: Retire db2008 and db2009 as x1 nodes [puppet] - 10https://gerrit.wikimedia.org/r/286172 (https://phabricator.wikimedia.org/T125827) [15:48:48] (03CR) 10Alex Monk: "True. Are we going to set this up in WikimediaMessages or just give the enwiki sysops a list of pages to create?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286168 (https://phabricator.wikimedia.org/T133981) (owner: 10Urbanecm) [15:49:02] (03CR) 10jenkins-bot: [V: 04-1] Retire db2008 and db2009 as x1 nodes [puppet] - 10https://gerrit.wikimedia.org/r/286172 (https://phabricator.wikimedia.org/T125827) (owner: 10Jcrespo) [15:55:45] (03PS2) 10Jcrespo: Retire db2008 and db2009 as x1 nodes [puppet] - 10https://gerrit.wikimedia.org/r/286172 (https://phabricator.wikimedia.org/T125827) [15:57:49] (03CR) 10Luke081515: [C: 04-1] "Just some formal stuff.." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286168 (https://phabricator.wikimedia.org/T133981) (owner: 10Urbanecm) [16:01:09] (03PS6) 10Urbanecm: Creation of page mover userright on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286168 (https://phabricator.wikimedia.org/T133981) [16:07:49] (03CR) 10Urbanecm: "Spaces have been added, name of the group has been changed to extendedmover." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286168 (https://phabricator.wikimedia.org/T133981) (owner: 10Urbanecm) [16:08:04] (03PS7) 10Jforrester: Creation of "extendedmover" user right on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286168 (https://phabricator.wikimedia.org/T133981) (owner: 10Urbanecm) [16:08:14] (03CR) 10Jforrester: [C: 031] Creation of "extendedmover" user right on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286168 (https://phabricator.wikimedia.org/T133981) (owner: 10Urbanecm) [16:08:31] (03CR) 10Jforrester: "List of pages to create is sufficient." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286168 (https://phabricator.wikimedia.org/T133981) (owner: 10Urbanecm) [16:14:33] 06Operations, 10ops-eqiad, 06Analytics-Kanban: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2243191 (10GWicke) The reason RESTBase (and many other Cassandra users) are using RAID-0 or JBOD is that it tends to provide more resilience and throughput at a given data duplication rati... [16:18:42] gehel, for some reason it feels like maps are running slower than usual [16:19:11] gehel, https://phabricator.wikimedia.org/T133988 [16:19:14] yurik: and you'd like me to have a look? [16:19:23] gehel, can you? [16:19:32] just to see if its something obvious [16:19:50] I can always have a look ... [16:19:50] we could of course ping bblack :D [16:20:22] I know mostly nothing about maps, or varnish, but let me see if I spot something obvious. [16:20:39] Do you just have a feeling that it is slow? Or actual metrics? [16:22:03] from here, I can download the tile that you linked in < 30ms, not incredibly slow... [16:22:11] (03CR) 10Alex Monk: "WikimediaMessages just holds Wikimedia-specific messages for translation and cross-wiki use, like our custom groups and stuff." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286168 (https://phabricator.wikimedia.org/T133981) (owner: 10Urbanecm) [16:22:15] lemme do a more scientific measure of that... [16:22:31] !log restarting elasticsearch server elastic1022.eqiad.wmnet (T110236) [16:22:32] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [16:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:23:03] (03Abandoned) 10Chad: Move varnish errorpage.html into its module [puppet] - 10https://gerrit.wikimedia.org/r/283577 (owner: 10Chad) [16:23:09] (03CR) 10Urbanecm: "Something like translatewiki.net but only for WMF sites? Thanks for your explanation." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286168 (https://phabricator.wikimedia.org/T133981) (owner: 10Urbanecm) [16:23:52] (03CR) 10Alex Monk: "No, something like every other MediaWiki extension that gets translations from translatewiki.net" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286168 (https://phabricator.wikimedia.org/T133981) (owner: 10Urbanecm) [16:23:52] gehel, its more of a thing with getting the whole map - sometimes it just sits there for multiseconds, with white squares, and I don't know if its just me [16:23:59] i'm pretty far from equiad [16:24:03] eqiad [16:24:25] you *should* get a cache closer to you, probably esams [16:30:39] 06Operations, 10ops-eqiad, 06Analytics-Kanban: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2251618 (10Eevans) > The reason RESTBase (and many other Cassandra users) are using RAID-0 or JBOD is that it tends to provide more resilience and throughput at a given data duplication ra... [16:32:07] (03PS32) 1020after4: Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) [16:32:54] 06Operations, 10ops-eqiad, 06Analytics-Kanban: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2251619 (10GWicke) On the other hand, losing one of only three machines is a larger blast radius than losing one of five or so, which when using RAID-0 cost about the same. [16:36:03] gehel, my bad, it is esams [16:36:05] https://phabricator.wikimedia.org/T133988 [16:40:44] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2251633 (10BBlack) = 24H Results: | Set | H/1 | Both | SPDY | H/2 | |--|--|--|--|--| | **All** | 54.75% | 33.17% | 7.07% | 5.01% | | **Text** | 57.58% | 28.72% | 6.99% | 6.72% | | **Up... [16:45:59] (03PS1) 10Cmjohnson: Adding production and mgmt dns entries for aqs1004-7 [dns] - 10https://gerrit.wikimedia.org/r/286176 [16:46:23] (03CR) 10jenkins-bot: [V: 04-1] Adding production and mgmt dns entries for aqs1004-7 [dns] - 10https://gerrit.wikimedia.org/r/286176 (owner: 10Cmjohnson) [16:47:35] (03PS2) 10Cmjohnson: Adding production and mgmt dns entries for aqs1004-7 [dns] - 10https://gerrit.wikimedia.org/r/286176 [16:51:51] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Allow RelEng nova log access - https://phabricator.wikimedia.org/T133992#2251644 (10thcipriani) [16:53:31] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2251659 (10BBlack) While the 24H data is much better quality (not so subject to daily regional highs and lows), the overall picture is still basically the same. There's a lot of inter... [16:54:36] (03CR) 10Cmjohnson: [C: 032] Adding production and mgmt dns entries for aqs1004-7 [dns] - 10https://gerrit.wikimedia.org/r/286176 (owner: 10Cmjohnson) [16:55:28] gehel, when you get tile, how many varnish hits do you see in the response headers? [16:55:32] x-cache header [16:56:37] I'm actually looking at the metrics we have (https://graphite.wikimedia.org/S/BV) and I see codfw being much faster than other DCs, which I understand as "we don't have a good cache hit rate" [16:56:47] !log restarting elasticsearch server elastic1023.eqiad.wmnet (T110236) [16:56:47] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [16:56:50] 06Operations, 10ops-codfw: rack/setup/deploy restbase200[7-9] switch configuration - https://phabricator.wikimedia.org/T133788#2251665 (10RobH) So in checking the switch config, I see the following: ge-1/0/2 up up restbase2007 default-switch private1-b-codfw 2018... [16:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:57:20] papaul: ^ the restbase2007 host network settings are all correct [16:57:52] 06Operations, 06Discovery, 10Kartotherian, 10Maps, and 2 others: Verify maps caching - https://phabricator.wikimedia.org/T133988#2251667 (10Gehel) An active measurement on 800 requests give me a 99%-ile = 180 ms. Not amazing, but not incredibly slow either. If I understand our metrics correctly (https://g... [16:57:53] so i imagine its a setting on those HPs, since they are all in different rows? [16:58:08] you need to see if their nic bios is set to allow pxe perhaps, but ive not seen that error before. [16:58:32] i'd try it again and see if perhaps they were working on dhcp stuff when you tried it initially? [16:59:05] yurik: X-Cache: cp2015 miss(0), cp1046 hit(1), cp3006 hit(2), cp3006 frontend hit(886) [16:59:15] yurik: not entirely sure what that means [17:01:49] robh: i am able to pxe boot [17:02:04] 06Operations, 06Discovery, 10Kartotherian, 10Maps, and 2 others: Verify maps caching - https://phabricator.wikimedia.org/T133988#2251670 (10Gehel) a:03Gehel [17:02:06] robh: i csn see the image loading up [17:02:11] So that is in the installer for debian? [17:02:22] Well, if you can PXE boot, it means the network was already right [17:02:28] so this is an issue in the installer it seems [17:02:47] because dhcp worked to hand you a lease to load the image [17:03:13] (i realize its complaining of dhcp in the installer as well, but since it worked for pxe and intial boot, we can likely narrow scope to installer issues) [17:06:19] 06Operations, 10ops-codfw: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2215958 (10RobH) [17:06:21] 06Operations, 10ops-codfw: rack/setup/deploy restbase200[7-9] switch configuration - https://phabricator.wikimedia.org/T133788#2251695 (10RobH) 05Open>03Resolved after irc chat, it seems this is only happening in the installer, and thus isnt an issue with netowrk config. I'd suggest disucsion on the setup... [17:06:38] i cannot type today, so many typos. [17:06:43] Robh: ok [17:06:56] papaul: try it again today just to see if its still happening [17:07:09] i got the impression that daniel and faidon were working on installer stuff yesterday, and i have no idea of the scope [17:07:29] unless that was today? (comment is today, so likely it is) [17:07:32] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Add RelEng to contint-roots - https://phabricator.wikimedia.org/T133990#2251528 (10JanZerebecki) The service has no restart defined that is better than start and stop, but it should be added to the sud... [17:07:39] robh: i did again today at 10:00am [17:07:42] ahh, ok [17:07:55] and that happens on all three is very strange, lemme dig into all their settings [17:08:01] yes [17:08:03] all 3 [17:08:09] same error [17:08:12] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: add nodepool restart to contint-admins - https://phabricator.wikimedia.org/T133990#2251707 (10JanZerebecki) [17:14:51] papaul: im rebooting restbase2007 to watch the rror as well [17:15:01] cuz yea, thats odd, ill have to push out the installer logs and parse throught htem [17:15:22] so 2007 will be posting (dont be surprised to see it cycle) [17:15:29] robh:ok [17:15:38] the reboot seems to have a lot more feedback on this HP [17:15:44] (percentages of post progress) [17:15:57] yes it does [17:16:04] the first batch i worked on was just a blank screen during that, its a nice improvement [17:16:21] likely just need to flash the older ones in the gen9 line. [17:16:34] 06Operations, 06Discovery, 10Kartotherian, 10Maps, and 2 others: Verify maps caching - https://phabricator.wikimedia.org/T133988#2251738 (10BBlack) There are a number of misunderstandings in this ticket, so let me step through them a bit, and then we can get back to the basics and ask whatever fundamental... [17:16:36] speeding up post and giving feedback is pretty major [17:16:55] hrmm, i get media cable failure for one of the pxe attempts. [17:16:56] robh: there is also a great improvement on their ilo web GUI [17:17:05] litsk its trying to use the second nic at first [17:17:06] 06Operations, 06Discovery, 10Kartotherian, 10Maps, and 2 others: Verify maps caching - https://phabricator.wikimedia.org/T133988#2251739 (10BBlack) p:05Triage>03Normal [17:17:08] yes it does that and than boots [17:17:14] that may be causing issues [17:17:17] did it do that on other HPs? [17:17:26] 06Operations, 06Discovery, 10Kartotherian, 10Maps, and 2 others: Questions about map tile cache performance - https://phabricator.wikimedia.org/T133988#2251410 (10BBlack) [17:17:26] yes [17:17:38] rephrase: have you seen that on HPs that install successfully? [17:17:49] cuz it may be borking the installer, trying to use the unused nic port. [17:18:00] yep [17:18:02] thats whats happening [17:18:13] its failing to see connection link in installer... odd [17:18:43] bblack, wow, thanks for the explanation!!! [17:21:45] papaul: So I see both a 4 port 1GBE NIC, and a second NIC added in with no info [17:21:54] do these have 10Gbit nic as well? [17:22:05] 2 10 G [17:22:17] but we aren't usign them in the new systems at this time? [17:22:24] no [17:22:34] we are using 1G [17:22:40] the first one like we do [17:23:32] HP FlexFabric 10Gb 2port 534FLR-SFP+ Adapter yep see it in the device inventory [17:23:36] with a status of unknown, which is odd [17:23:52] lemme check and see what the other new restbase have. [17:25:02] robh: ok [17:25:27] so restbase2006 (older one) doesnt have 10GB, but has unknown state for nic1, which is useless [17:25:42] 2007 and 2008 both have 10gbit card and show unknown, so likely not helping much [17:26:42] papaul: can you hop on 2008 and disable the 10GBit card in bios? [17:26:50] faster via crash cart than serial redirection =] [17:27:04] lets try disabling the secondary nic entirely and see what happens in the installer [17:27:09] i will when i get back in the DC i am on lunch break [17:27:16] oh, no rush, lemme try to do remotely then! [17:27:46] robh:ok [17:28:19] (03CR) 10Aklapper: "Why does it think we are in the "phabricator_maniphest" DB? It's never mentioned in the .sh.erb file and the "phabricator_project" DB is d" [puppet] - 10https://gerrit.wikimedia.org/r/286133 (owner: 10Aklapper) [17:33:57] so i set the add on nic to not pxe boot [17:34:07] it was set to allow it, dunno if its enough, i dont see an option to disable the port entirely [17:34:31] robh:ok [17:34:33] ahh, found it [17:34:44] so i set it to not pxe boot first, going to see if that fixes it [17:34:51] cuz i rather leave the nic enabled if needed later. [17:34:58] if that doesnt work, i'll disable the add on 10Gbe nic entirely [17:36:18] 06Operations, 10ops-eqiad, 06Analytics-Kanban: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2251772 (10fgiunchedi) >>! In T133785#2250716, @fgiunchedi wrote: >>>! In T133785#2249043, @Ottomata wrote: >> Ok! We discussed partitioning today. We'd like the following: >> >> - / a... [17:36:18] basically i want to do the minimum amount of changes, so if we want to use it in the future, we dont have to go back into bios to enable it (if possible) [17:36:41] but the fact it was trying to boot from it first, rather than the 1GBE, was the indicator if something being off [17:36:57] indicator of something being off even. [17:38:36] so it boots right in post now, doesnt try and fail on other ports, loading image. [17:38:55] aude: I'll start to deploy the backport to wmf.22 [17:39:30] bah, failing same issue in the os, rebooting to disable nic in bios [17:40:12] aude: will you be around for that? [17:41:35] jzerebecki: ok [17:41:39] robh: the issue is the installer can not find the right nic to configured right? [17:41:51] papaul: thats what i think, i think the installer is failing over to the 10GBE [17:42:05] since we dont need to have both run at the same time, i'm going to disable it in bios and see if it fixes the issue [17:42:18] system just sees 10GB as the primary when its installed it seems. [17:42:26] (which makes sense, since typically it would be preferred) [17:42:34] robh:ok [17:43:01] im changing it on 2007 for now, if that fixes it, i'll let you know and leave the rest to you to toggle them off (so we both know how) [17:43:20] robh: ok [17:45:57] !log restarting elasticsearch server elastic1024.eqiad.wmnet (T110236) [17:45:58] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [17:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:48:06] papaul: that seemed to fix it, its now loading components and has passed the network configuration stage of the installer [17:48:18] so you want to drop into bios, system config, enable/disable pci devices [17:48:31] and disable the 10gb card on restbase200[89] [17:48:42] i'll finish the install on this, but not sign any keys [17:49:12] * robh doesnt forget to fix boot order [17:49:17] darn hps and lack of one time boot via cmd [17:49:35] ok [17:49:39] well, there is another problem... [17:49:44] doesnt see disks in installer... [17:49:57] papaul: or does this not have the ssds installed yet? [17:50:21] if you are eating this can wait for you to finish your lunch [17:50:22] robH ssds are installed [17:50:26] =] [17:50:34] odd, ok, will go into bios and poke about [17:55:27] 06Operations, 06Discovery, 10Kartotherian, 10Maps, and 2 others: Questions about map tile cache performance - https://phabricator.wikimedia.org/T133988#2251792 (10Yurik) @bblack, awesome explanation, thank you, this clarifies so much! I am observing a very slow load time on the landline connection, while... [17:56:00] Any op working? [17:56:22] partying, not working [17:57:04] ? [17:57:19] jynus_do_not_tru: uh, yes? [17:57:32] bblack, on mobile [17:57:40] Or robh [17:58:20] I'm not sure what you are asking? If the question was 'are you working on mobile' the answer is nope [17:58:32] Please run for me "Start all slaves;" on dbstore2001 [17:58:51] Or alert in a few hours [17:58:58] you're not logged into nickserv or anything [17:59:03] yeahhhh [17:59:06] I kow [17:59:27] heh [17:59:29] so how do i know its you and not just some random person? your nick even says do not trust? [17:59:41] this seems like the kind of thing someone would socially engineer to test opsen =P [17:59:44] M3 s2 warning [17:59:54] M3: Structured data section on Commons - https://phabricator.wikimedia.org/M3 [18:00:03] Only me would kknow taht [18:00:04] can you use hangouts on your phone? [18:00:09] Sure [18:00:56] well, its legit that those are errors [18:01:37] Robh ping me on hangout [18:02:22] pinged, just confirm its you in irc via hangout and i'll just be your remote hands =] [18:02:47] id confirmed, you are you. [18:02:49] ok, logging in now [18:05:41] done [18:06:57] jzerebecki: did we deploy yet? [18:07:11] !log started all slaves via dbstore2002 per jaime's request [18:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:07:23] aude: submodule update done, now deploying it [18:07:37] ok [18:08:34] ok, fourth try to get into bios on this hp system [18:08:41] so annoying it lacks the 'boot into bios' like dell. [18:08:47] from the command line (that i can see) [18:09:10] unless one of the boot order numbers is bios, likely is [18:11:19] wasn't it dbstore2001? [18:12:10] ahh fuck me [18:12:13] jynus_do_not_tru: ^ [18:12:25] well, start all slaves isnt bad when they were running, but seems i was on wrong host [18:12:39] 06Operations, 06Discovery, 10Kartotherian, 10Maps, and 2 others: Questions about map tile cache performance - https://phabricator.wikimedia.org/T133988#2251848 (10BBlack) There's still a lot of missing detail here. What browser/os/version is this? How do I reproduce the same page load? What else is at t... [18:12:55] !log jzerebecki@tin Synchronized php-1.27.0-wmf.22/extensions/Wikidata/extensions/Wikibase/repo/includes/Dumpers/DumpGenerator.php: wmf.22 fc20c54f7915b94ec0d15ef17e207c116910623d 1 of 2 T133924 (duration: 00m 44s) [18:12:56] T133924: Wikidata dump is missing entities - https://phabricator.wikimedia.org/T133924 [18:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:14:14] !log started all slaves via dbstore2001 this time. [18:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:14:25] so checked with jaime via hangout, no need to stop them on 2002, already broken there. [18:14:27] !log jzerebecki@tin Synchronized php-1.27.0-wmf.22/extensions/Wikidata/extensions/Wikibase/repo/includes/Hooks/OutputPageBeforeHTMLHookHandler.php: wmf.22 fc20c54f7915b94ec0d15ef17e207c116910623d 2 of 2 T132645 (duration: 00m 34s) [18:14:28] T132645: [Bug] fatal error: Argument 1 passed to CachingEntityRevisionLookup::getEntityRevision() must be an instance of EntityId, null given - https://phabricator.wikimedia.org/T132645 [18:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:14:42] aude: done [18:14:45] ok [18:15:06] (03PS1) 10Catrope: Re-enable cross-wiki beta feature in labs for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286191 [18:15:27] (03CR) 10Catrope: [C: 032] Re-enable cross-wiki beta feature in labs for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286191 (owner: 10Catrope) [18:15:56] did you update the submodule? [18:16:05] 06Operations, 06Discovery, 10Kartotherian, 10Maps, and 2 others: Questions about map tile cache performance - https://phabricator.wikimedia.org/T133988#2251861 (10BBlack) Also, note this line in your original landline ping results: ``` 6. RT.TC2.AMS.NL.retn.net 3.2% 250 126.4 129.4 96.3 2965... [18:16:15] (03Merged) 10jenkins-bot: Re-enable cross-wiki beta feature in labs for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286191 (owner: 10Catrope) [18:16:21] trying to view https://test.wikidata.org/w/index.php?title=Special:Undelete&target=Q2322×tamp=20160418012907 doesn't work for me :( [18:16:48] latest thing i see on tin for Wikidata is Date: Wed Apr 13 23:11:45 2016 +0200 [18:16:50] aude: yes here https://gerrit.wikimedia.org/r/#/c/286189/ and branch https://gerrit.wikimedia.org/r/#/c/286190/ [18:16:54] on tin? [18:17:20] y [18:17:30] hmm [18:18:20] jzerebecki: can i try? [18:19:13] aude: you are right i forgot the submodule update *sigh* [18:19:24] ok [18:19:28] it happens [18:20:27] !log jzerebecki@tin Synchronized php-1.27.0-wmf.22/extensions/Wikidata/extensions/Wikibase/repo/includes/Dumpers/DumpGenerator.php: wmf.22 fc20c54f7915b94ec0d15ef17e207c116910623d 1 of 2 T133924 (duration: 00m 29s) [18:20:28] T133924: Wikidata dump is missing entities - https://phabricator.wikimedia.org/T133924 [18:21:10] !log jzerebecki@tin Synchronized php-1.27.0-wmf.22/extensions/Wikidata/extensions/Wikibase/repo/includes/Hooks/OutputPageBeforeHTMLHookHandler.php: wmf.22 fc20c54f7915b94ec0d15ef17e207c116910623d 2 of 2 T132645 (duration: 00m 28s) [18:21:10] T132645: [Bug] fatal error: Argument 1 passed to CachingEntityRevisionLookup::getEntityRevision() must be an instance of EntityId, null given - https://phabricator.wikimedia.org/T132645 [18:21:17] aude: please try again [18:21:21] ok, much better :) [18:21:22] thanks [18:21:26] thanks [18:22:50] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: add nodepool restart to contint-admins - https://phabricator.wikimedia.org/T133990#2251871 (10hashar) Nodepool has the ability to dump a stack dump. One sends `SIGUSR2` and that spurts trace to `/var/l... [18:27:48] papaul: so im haing issues seeing the disks in bios, i'll let you fix the other two and see if they have the same sisue [18:27:49] issue [18:28:35] rohb:working on it [18:28:45] robh: thanks [18:29:12] im actually going to run out down the street to snag some lunch myself in a few minutes [18:30:43] ok [18:32:22] so once they are booting into the installer correctly, it failed to see disks [18:32:27] we need to put the idsks into a jbod type setup [18:32:52] ok [18:33:17] its just a sas controller though [18:33:21] i just checked the order [18:33:46] ok, afk for a short bit, back shortly [18:34:33] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [18:36:04] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [18:49:18] https://logstash.wikimedia.org/#/dashboard/elasticsearch/mediawiki-errors is now officially useless [19:05:51] How do I report a Wikipedia violator? [19:06:23] I am not sure how this works - anyone there to help? [19:06:47] saracnations: define 'wikipedia violator'? [19:07:23] 06Operations, 06Discovery, 10Kartotherian, 10Maps, and 2 others: Questions about map tile cache performance - https://phabricator.wikimedia.org/T133988#2251965 (10Yurik) Chrome 50.0.2661.86 (Official Build) (64-bit) on Ubuntu 16.04. I used https://maps.wikimedia.org/#9/50.7060/-100.3725 for both tests, on... [19:07:24] !log restarting elasticsearch server elastic1025.eqiad.wmnet (T110236) [19:07:25] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [19:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:08:05] Hey, question, is X-SPAM set up to send spam alerts from an person's own email address? [19:08:23] someone who has motives to continue to remove long standing citations in articles and make accusations against a well established media outlets with some type of motive? [19:08:49] saracnations: sounds like an issue to report at the local village pump or admin noticeboard? [19:09:05] I don't know how to do that, how do I do that? [19:09:11] and thank you [19:09:42] saracnations: is this for the english wikipeida? [19:09:47] Yes English [19:11:02] saracnations: I think https://en.wikipedia.org/wiki/Wikipedia:Administrators%27_noticeboard/Incidents is the right place to report that [19:11:23] let me look at it - one second and thanks [19:12:32] so just starting a conversation on that page gets who exactly to look at it? [19:12:48] saracnations: administrators for the english wikipedia [19:13:52] ok so I just basically start a new paragraph and include the user in it with " == == " and go from there? [19:14:08] I think so, yes. [19:14:20] thank you for your prompt replies and guidance [19:20:37] 06Operations, 06Discovery, 10Kartotherian, 10Maps, and 2 others: Questions about map tile cache performance - https://phabricator.wikimedia.org/T133988#2251990 (10BBlack) Refreshing after a long pause has to re-establish a connection. If you're comparing to google, then trace your pings to whatever edge I... [19:20:46] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 625 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5162510 keys - replication_delay is 625 [19:24:12] (03CR) 10Dzahn: [C: 031] "nevermind, wrong db on my side. the query works in phabricator_project. 549 rows" [puppet] - 10https://gerrit.wikimedia.org/r/286133 (owner: 10Aklapper) [19:24:35] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5135457 keys - replication_delay is 0 [19:29:11] !log restarting elasticsearch server elastic1026.eqiad.wmnet (T110236) [19:29:12] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [19:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:32:24] 06Operations, 06Discovery, 10Kartotherian, 10Maps, and 2 others: Questions about map tile cache performance - https://phabricator.wikimedia.org/T133988#2251998 (10Yurik) 05Open>03Invalid @bblack, thanks for looking into this. Google's servers are also ipv6, and their ping response is around 10.5, which... [19:38:15] (03PS3) 10Dzahn: List Phabricator projects with only a single workboard column [puppet] - 10https://gerrit.wikimedia.org/r/286133 (owner: 10Aklapper) [19:43:34] (03CR) 10Dzahn: [C: 032] List Phabricator projects with only a single workboard column [puppet] - 10https://gerrit.wikimedia.org/r/286133 (owner: 10Aklapper) [19:49:53] 06Operations, 06Discovery, 10Kartotherian, 10Maps, and 2 others: Questions about map tile cache performance - https://phabricator.wikimedia.org/T133988#2252027 (10BBlack) 05Invalid>03Resolved Yes, at 10ms that probably means the gmaps endpoint you're hitting is inside of Russia, which is completely dif... [19:56:14] !log catrope@tin Synchronized php-1.27.0-wmf.22/extensions/CentralNotice/: T133971 (duration: 00m 41s) [19:56:15] T133971: CentralNotice: choiceData RL module hashes are flapping - https://phabricator.wikimedia.org/T133971 [19:56:22] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [106250000.0] [19:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:56:33] !log (Re)starting cleanup on restbase1009-{a,b}.eqiad.wmnet [19:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:01:32] (03PS1) 10Dzahn: planet: update feed URL for Netha Hussain [puppet] - 10https://gerrit.wikimedia.org/r/286209 (https://phabricator.wikimedia.org/T133987) [20:02:35] 06Operations, 10RESTBase, 10RESTBase-Cassandra, 10cassandra, 13Patch-For-Review: automated invocation of Cassandra repair jobs - https://phabricator.wikimedia.org/T92355#2252047 (10Eevans) [20:08:26] (03PS2) 10Mobrovac: Changeprop: Set hyperswitch as the start-up module [puppet] - 10https://gerrit.wikimedia.org/r/286153 [20:15:35] 06Operations, 10RESTBase-Cassandra, 10cassandra: Cassandra uses default ip address for outbound packets while bootstrapping - https://phabricator.wikimedia.org/T128590#2252082 (10Eevans) [20:17:03] (03PS2) 10Dzahn: planet: update feed URL for Netha Hussain [puppet] - 10https://gerrit.wikimedia.org/r/286209 (https://phabricator.wikimedia.org/T133987) [20:17:57] (03CR) 10Dzahn: [C: 032] planet: update feed URL for Netha Hussain [puppet] - 10https://gerrit.wikimedia.org/r/286209 (https://phabricator.wikimedia.org/T133987) (owner: 10Dzahn) [20:19:22] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [93750000.0] [20:21:00] 07Blocked-on-Operations, 06Operations, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#2252092 (10Eevans) [20:22:18] hmm.. much wasted screen estate [20:23:37] (03CR) 10Mobrovac: "See Iaea347eb56448e36fc7372f839aee41d7bb0f368" [puppet] - 10https://gerrit.wikimedia.org/r/285678 (https://phabricator.wikimedia.org/T133221) (owner: 10Ppchelko) [20:24:57] 06Operations, 10RESTBase-Cassandra, 06Services, 10cassandra, 13Patch-For-Review: Evaluate Brotli compression for Cassandra - https://phabricator.wikimedia.org/T125906#2252097 (10Eevans) [20:25:31] 06Operations, 10RESTBase-Cassandra, 06Services, 10cassandra, 13Patch-For-Review: Evaluate Brotli compression for Cassandra - https://phabricator.wikimedia.org/T125906#2000325 (10Eevans) [20:25:53] (03Abandoned) 10Ppchelko: Set up redlinks processing in change propagation [puppet] - 10https://gerrit.wikimedia.org/r/285678 (https://phabricator.wikimedia.org/T133221) (owner: 10Ppchelko) [20:30:01] 06Operations, 10RESTBase, 10RESTBase-Cassandra, 10cassandra, 13Patch-For-Review: Automated invocation of Cassandra repair jobs - https://phabricator.wikimedia.org/T92355#2252107 (10Eevans) [20:31:17] 06Operations, 10RESTBase-Cassandra: Efficacy of DateTieredCompactionStrategy - https://phabricator.wikimedia.org/T126221#2252111 (10Eevans) [20:31:45] 06Operations, 10RESTBase-Cassandra, 10cassandra: Efficacy of DateTieredCompactionStrategy - https://phabricator.wikimedia.org/T126221#2007841 (10Eevans) [20:34:22] 06Operations, 10RESTBase, 10RESTBase-Cassandra, 10cassandra, and 2 others: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#2252128 (10Eevans) [20:35:29] (03PS1) 10Alex Monk: [WIP] Move puppet repository cherrypick counter to diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/286226 (https://phabricator.wikimedia.org/T132997) [20:35:37] 06Operations, 10RESTBase-Cassandra, 06Services, 10cassandra: Highest SSTables / read thresholds - https://phabricator.wikimedia.org/T133091#2252133 (10Eevans) [20:39:41] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 06Labs, 10hardware-requests: eqiad: (2) Relevance forge servers - https://phabricator.wikimedia.org/T131184#2252155 (10Gehel) Damn, this labs thing is confusing... @chasemp thanks for the precisions. I'm not entirely sure I understand what you mean b... [20:40:00] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Move puppet repository cherrypick counter to diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/286226 (https://phabricator.wikimedia.org/T132997) (owner: 10Alex Monk) [20:43:20] (03PS2) 10Alex Monk: [WIP] Move puppet repository cherrypick counter to diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/286226 (https://phabricator.wikimedia.org/T132997) [20:43:42] 06Operations, 10RESTBase-Cassandra: service cassandra-b fails on restbase2004 - https://phabricator.wikimedia.org/T132999#2252160 (10Eevans) 05Open>03Resolved a:03Eevans I think this must have just been an Icinga snafu, the over optimistic use of an expiring acknowledgement, or somesuch. The instance in... [20:44:26] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Move puppet repository cherrypick counter to diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/286226 (https://phabricator.wikimedia.org/T132997) (owner: 10Alex Monk) [20:45:36] (03PS3) 10Alex Monk: [WIP] Move puppet repository cherrypick counter to diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/286226 (https://phabricator.wikimedia.org/T132997) [20:50:33] 06Operations, 10RESTBase-Cassandra: service cassandra-b fails on restbase2004 - https://phabricator.wikimedia.org/T132999#2252190 (10Dzahn) Fine with me, but i fail to see how "cassandra[11544]: Exception encountered during startup: " can be an icinga snafu. [20:57:29] 07Puppet, 10Beta-Cluster-Infrastructure, 07Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2252240 (10hashar) [20:57:31] 07Puppet, 10Beta-Cluster-Infrastructure: deployment-puppetmaster puppet fails due to "Could not render to Puppet::Network::Format[msgpack]: undefined method `to_msgpack' for #" - https://phabricator.wikimedia.org/T133989#2252236 (10hashar) 05Open>03Resolved a:03hasha... [20:59:15] !log restarting elasticsearch server elastic1027.eqiad.wmnet (T110236) [20:59:16] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [20:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:59:24] 06Operations: Remove unused Samba packages - https://phabricator.wikimedia.org/T132915#2252250 (10Dzahn) aah! that explains, thanks [21:03:28] PROBLEM - puppet last run on mw2061 is CRITICAL: CRITICAL: Puppet has 1 failures [21:23:11] (03PS1) 10ArielGlenn: don't try to preserve owner/group for kiwix local mirror [puppet] - 10https://gerrit.wikimedia.org/r/286240 [21:24:31] (03CR) 10ArielGlenn: [C: 032] don't try to preserve owner/group for kiwix local mirror [puppet] - 10https://gerrit.wikimedia.org/r/286240 (owner: 10ArielGlenn) [21:30:19] RECOVERY - puppet last run on mw2061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:44:44] 07Blocked-on-Operations, 06Operations, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#2252471 (10Eevans) [22:00:15] PROBLEM - check_apache2 on payments2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [22:01:05] ^^^ payments2002 .... looking [22:05:10] PROBLEM - check_apache2 on payments2003 is CRITICAL: PROCS CRITICAL: 0 processes with command name apache2 [22:10:10] RECOVERY - check_apache2 on payments2002 is OK: PROCS OK: 6 processes with command name apache2 [22:10:10] RECOVERY - check_apache2 on payments2003 is OK: PROCS OK: 6 processes with command name apache2 [22:12:29] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2252547 (10MF-Warburg) [22:14:41] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2252547 (10MF-Warburg) [22:22:04] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2252715 (10Dzahn) [22:24:37] 07Blocked-on-Operations, 10MediaWiki-Database, 13Patch-For-Review, 07Schema-change: Change pp_sortkey from float to double - https://phabricator.wikimedia.org/T107323#2252729 (10Danny_B) [22:31:35] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2252741 (10Krenair) @jcrespo: Hey. Is everything necessary ready for this? Could you arrange the replication? [22:34:29] (03PS1) 10Dzahn: add new project language Jamaican (jam) [dns] - 10https://gerrit.wikimedia.org/r/286248 (https://phabricator.wikimedia.org/T134017) [22:37:11] (03PS2) 10Dzahn: add new project language Jamaican (jam) [dns] - 10https://gerrit.wikimedia.org/r/286248 (https://phabricator.wikimedia.org/T134017) [22:38:23] a new wikipedia? [22:39:23] yes [22:39:26] ah [22:39:46] poke me if you need help with any of the wikidata parts of creating it [22:40:56] (03CR) 10Dzahn: [C: 032] "Pump up the jam, pump it up" [dns] - 10https://gerrit.wikimedia.org/r/286248 (https://phabricator.wikimedia.org/T134017) (owner: 10Dzahn) [22:43:09] as Reedy pointed out.. they waited 6 years for it [22:57:34] 07Puppet, 10Beta-Cluster-Infrastructure: deployment-puppetmaster puppet fails due to "Could not render to Puppet::Network::Format[msgpack]: undefined method `to_msgpack' for #" - https://phabricator.wikimedia.org/T133989#2252795 (10Krenair) I had to uninstall the ruby-msg... [22:57:45] !log DNS - forced authdns-gen-zones etc from https://phabricator.wikimedia.org/T97051#1994679 on ns0/ns1/ns2 to get new language added [22:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:01:03] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2252547 (10Dzahn) Due to T97051 i had to run the commands from T97051#1994679 to force re-creation of the zones. But now it has been added: https:... [23:01:22] there it is, new an shiny [23:01:24] https://jam.wikipedia.org [23:10:38] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [23:11:48] PROBLEM - YARN NodeManager Node-State on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:13:08] PROBLEM - RAID on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:13:29] RECOVERY - YARN NodeManager Node-State on analytics1047 is OK: OK: YARN NodeManager analytics1047.eqiad.wmnet:8041 Node-State: RUNNING [23:14:59] RECOVERY - RAID on analytics1047 is OK: OK: optimal, 13 logical, 14 physical [23:15:47] (03PS1) 10Dzahn: add jamwiki to langlist, InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286258 (https://phabricator.wikimedia.org/T134017) [23:18:14] (03PS2) 10Dzahn: add jamwiki to langlist, InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286258 (https://phabricator.wikimedia.org/T134017) [23:20:45] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2252871 (10Dzahn) Do you have a project logo file ready? We need to upload it. (to /static/images/project-logos/) Could you paste the localized nam... [23:27:12] It's in the description mutante ... [23:28:01] (03PS4) 10Alex Monk: [WIP] Move puppet repository cherrypick counter to diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/286226 (https://phabricator.wikimedia.org/T132997) [23:28:11] Krenair: found it. "Diskoshan" hehe [23:28:36] Wikipidia diskoshan [23:29:24] and the logo too.. duh [23:29:33] how do those usually get into /static/ [23:30:34] (03PS3) 10Dzahn: add jamwiki to langlist, InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286258 (https://phabricator.wikimedia.org/T134017) [23:31:11] think you have to take the right thumb and put it through optipng [23:32:10] ah [23:36:29] (03PS4) 10Dzahn: add jamwiki to langlist, InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286258 (https://phabricator.wikimedia.org/T134017) [23:36:56] done, 135x155 thumb and ran through optipng [23:44:57] (03PS5) 10Alex Monk: Move puppet repository cherrypick counter to diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/286226 (https://phabricator.wikimedia.org/T132997) [23:56:28] PROBLEM - puppet last run on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:56:39] PROBLEM - YARN NodeManager Node-State on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:56:39] PROBLEM - DPKG on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:56:39] PROBLEM - Hadoop DataNode on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:56:39] PROBLEM - dhclient process on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:56:49] PROBLEM - Disk space on Hadoop worker on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:57:09] PROBLEM - Hadoop NodeManager on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:57:09] PROBLEM - configured eth on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:57:19] PROBLEM - salt-minion processes on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:57:58] PROBLEM - RAID on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:57:59] PROBLEM - Disk space on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:58:19] PROBLEM - Check size of conntrack table on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.