[00:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161006T0000). [00:00:26] aude: no :( still in the process of diagnosing the log explosion (I think) [00:00:51] :/ [00:01:04] just wondering if any of it is at all related to wikidata [00:01:16] https://phabricator.wikimedia.org/T147520 isn't very informative [00:01:27] RoanKattouw: 314452 live on mw1099 [00:01:55] Thanks, testing [00:02:09] aude: nothing related to wikidata afaik [00:02:42] !log cache_maps: rolling depooled frontend restarts for libvmod-netmapper upgrade [00:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:02:50] aude: AaronSchulz may be able to give more details about what the errors he's been tracking down. [00:03:11] Hmm, it doesn't seem to be working [00:03:13] !log Created Flow tables on labswiki (wikitech.wikimedia.org) [00:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:03:31] createExtensionTables.php for Flow seems to work [00:04:13] Oh, wait, I only cherry-picked to wmf21 [00:04:39] * Dereckson nods [00:04:43] And today's train was rolled back, so wikidata is on wmf20 [00:04:51] Let's see if I can test this on testwiki [00:05:00] so only testable on group [00:05:54] Dereckson: Yup, working [00:06:03] we also need it for wmf.20? [00:06:10] Not sure [00:06:21] But let's not backport it there for now [00:06:27] ok [00:06:43] Matt should be able to investigate on testwiki, and wmf21 will hopefully roll out to other wikis soon [00:07:33] ack'ed, syncing to prod [00:08:00] Thanks [00:08:22] !log dereckson@tin Synchronized php-1.28.0-wmf.21/extensions/Flow/: Make more types of exceptions loggable ([[Gerrit:314452]], T135545, T138310) (duration: 01m 12s) [00:08:24] T135545: When "default" is changed on the json page in Gadgets definitions space, it is not reflected on the Special:Gadgets page - https://phabricator.wikimedia.org/T135545 [00:08:24] T138310: Flow as a Beta feature: enable, disable and reenable doesn't seem to work - https://phabricator.wikimedia.org/T138310 [00:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:10:13] everyone done deploying? [00:10:14] So SWAT is done. [00:10:16] twentyafterfour: yes [00:10:20] hah nice timing [00:10:48] I am about to update phabricator, didn't want to take it offline at a bad time [00:11:20] !log scheduled phabricator update starting momentarily. service will be offline for (hopefully) less than 5 minutes [00:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:13:23] thcipriani: thanks [00:15:26] !log cache_upload: rolling depooled frontend restarts for libvmod-netmapper upgrade [00:15:27] aude: :) thanks for checking on wikidata, appreciated. [00:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:15:35] !log phabricator update complete and service is restored [00:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:18:54] 06Operations, 10Phabricator (2016-10-05), 07Upstream: phabricator: can't search for RT tickets (reference field) anymore - https://phabricator.wikimedia.org/T146116#2695556 (10mmodell) 05Open>03Resolved a:03mmodell [00:47:57] Thanks, RoanKattouw, Dereckson. [00:49:45] You're welcome. [00:51:44] (03CR) 10Dzahn: [C: 032] Make nginx optional in aptly class [puppet] - 10https://gerrit.wikimedia.org/r/312562 (owner: 1020after4) [00:51:51] (03PS9) 10Dzahn: Make nginx optional in aptly class [puppet] - 10https://gerrit.wikimedia.org/r/312562 (owner: 1020after4) [00:54:54] 06Operations, 10Phabricator (2016-10-05), 07Upstream: phabricator: can't search for RT tickets (reference field) anymore - https://phabricator.wikimedia.org/T146116#2695646 (10Dzahn) confirmed the reference field is back and works, thank you very much :) [00:55:39] PROBLEM - puppet last run on db1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:58:42] (03CR) 10Dzahn: "confirmed no-op on toolsbeta-aptly-server-01, mwv-apt-01, deployment-tin, deployment-mira, mw1017" [puppet] - 10https://gerrit.wikimedia.org/r/312562 (owner: 1020after4) [01:00:13] twentyafterfour: ^ and thanks for the fixed RT search [01:03:24] Platonides: https://phabricator.wikimedia.org/T143138 they justified the projectcom subdomain, that's fine for you? [01:04:41] mutante: you're welcome :) [01:04:46] thcipriani: https://gerrit.wikimedia.org/r/#/c/314461/ [01:07:39] (03PS1) 10Legoktm: Enable magic links regardless of MediaWiki core default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314463 (https://phabricator.wikimedia.org/T147536) [01:11:00] (03PS1) 10Dereckson: Add ecwikimedia to the list of private wikis [puppet] - 10https://gerrit.wikimedia.org/r/314465 (https://phabricator.wikimedia.org/T135521) [01:12:46] (03CR) 10Dzahn: Configuration for Aphlict (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/313937 (https://phabricator.wikimedia.org/T112765) (owner: 1020after4) [01:14:28] (03PS1) 10Dereckson: Activate ec.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/314466 (https://phabricator.wikimedia.org/T135521) [01:18:24] RECOVERY - puppet last run on db1026 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [01:19:55] (03CR) 10Dzahn: [C: 031] "yep, that's the ISO Alpha-2 code for Ecuador" [dns] - 10https://gerrit.wikimedia.org/r/314466 (https://phabricator.wikimedia.org/T135521) (owner: 10Dereckson) [01:21:18] (03CR) 1020after4: Configuration for Aphlict (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/313937 (https://phabricator.wikimedia.org/T112765) (owner: 1020after4) [01:22:58] (03PS1) 10Dereckson: Sort by alphabetical order wikimedia-chapter Apache sites [puppet] - 10https://gerrit.wikimedia.org/r/314469 [01:24:04] (03PS1) 10Dereckson: Add ec.wikimedia.org to Apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/314470 (https://phabricator.wikimedia.org/T135521) [01:25:14] (03CR) 10Dzahn: Configuration for Aphlict (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/313937 (https://phabricator.wikimedia.org/T112765) (owner: 1020after4) [01:26:44] *waves*, bbl [01:28:21] aude: private chapter wikis should got the Wikidata client? [01:29:14] (I imagine not) [01:41:22] Dereckson: probably not [01:41:36] i suppose arbitrary access would work, though [01:43:30] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:44:30] Yes, they could theorically have some use. For example, Commons uses Wikidata to automate some l10n templates. [01:47:48] best to not enable yet, but could be possible in the future [01:54:12] (03PS1) 10Dereckson: Initial configuration for ec.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314471 (https://phabricator.wikimedia.org/T135521) [02:08:46] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:11:51] o/ [02:11:55] So, $dayjob. In a delightful fit of ironies, their infrastructure is best described as "fucking disaster" and I am now the lead in "make things clean, well organized, and done right" [02:12:08] So I brought lessons learned at the WMF in. Puppet, with manifests in git and proper code review. [02:13:45] Made a neat trick you guys might like: I have the puppetmaster have /two/ checked out trees, one which is HEAD, and the other which pulls in a 'production' tag (both from master). Use environdir in puppet.conf to point at the directory containing both. [02:14:17] git has a post-receive that pulls from both places. So merges on master reflect instantly in $environdir/staging [02:14:37] And from the clients, you can puppet agent --environment=staging and pull from /that/ [02:14:55] puppet agent defaulst to $dir/production, which is tagged. [02:16:00] When you have tested and are happy with the manifest, just move the git tag to the good revision and *bam*, $dir/production gets the version you were happy with. [02:16:55] Next step is to add some glue to add environ when you make a branch in the repo, so you can run the agent against that explicitly while you work on the new thing without fucking with anything else. [02:17:42] (Also, you can just put 'environment=foo' in the client puppet.conf on dev boxen so it sticks while it's WIP) [02:18:01] paravoid: Thought you might like that ^^ [02:31:58] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.20) (duration: 11m 53s) [02:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:07:24] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.21) (duration: 15m 38s) [03:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:14:25] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Oct 6 03:14:25 UTC 2016 (duration 7m 1s) [03:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:33:41] PROBLEM - puppet last run on analytics1045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:58:50] RECOVERY - puppet last run on analytics1045 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [04:56:22] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3053891 keys - replication_delay is 0 [06:20:29] 06Operations, 10ops-eqiad, 10DBA: db1055: degraded array - https://phabricator.wikimedia.org/T147172#2695959 (10Marostegui) Hi, The disk is still being rebuilt: ``` root@db1055:~# megacli -PDRbld -ShowProg -PhysDrv [32:0] -aALL Rebuild Progress on Device at Enclosure 32, Slot 0 Completed 76% in 952 Minut... [06:36:49] !log reimaging mw1187, mw1188, mw1211 to jessie (the latter is a scap proxy) [06:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:38:43] !log reimaging mw1208 and mw1221 to Debian Jessie (API appservers) [06:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:49:00] PROBLEM - puppet last run on elastic1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:54:49] PROBLEM - salt-minion processes on puppetmaster1001 is CRITICAL: PROCS CRITICAL: 5 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [07:02:14] (03PS1) 10Muehlenhoff: Update to 1.0.2i [debs/openssl] - 10https://gerrit.wikimedia.org/r/314504 [07:04:06] (03CR) 10Muehlenhoff: [C: 032] Update to 1.0.2i [debs/openssl] - 10https://gerrit.wikimedia.org/r/314504 (owner: 10Muehlenhoff) [07:07:05] PROBLEM - puppet last run on mw1163 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:09:44] (03CR) 10Jcrespo: "This is in the good direction, but needs some extra changes. I can do them myself when I have a more stable internet connection." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/314286 (https://phabricator.wikimedia.org/T146673) (owner: 10Paladox) [07:12:00] RECOVERY - puppet last run on elastic1017 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [07:13:39] 06Operations, 05Prometheus-metrics-monitoring: Upgrade mysqld_exporter to 0.9.0 - https://phabricator.wikimedia.org/T147476#2696001 (10jcrespo) @godog Do you want me to handle the upgrade? If you upload the new package, I can handle the puppet config changes (I want to enable SHOW ENGINE INNODB STATUS and do t... [07:13:56] 06Operations, 05Prometheus-metrics-monitoring: Upgrade mysqld_exporter to 0.9.0 - https://phabricator.wikimedia.org/T147476#2696005 (10jcrespo) @fgiunchedi ^ [07:20:23] RECOVERY - salt-minion processes on puppetmaster1001 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [07:22:03] (03CR) 10Paladox: "@Jcrespo what do you mean in the sql file?" [puppet] - 10https://gerrit.wikimedia.org/r/314286 (https://phabricator.wikimedia.org/T146673) (owner: 10Paladox) [07:22:58] (03PS3) 10Paladox: Create a phabricator_stopwords phabricator table in sql (innodb) [puppet] - 10https://gerrit.wikimedia.org/r/314286 (https://phabricator.wikimedia.org/T146673) [07:24:19] (03PS4) 10Paladox: Create a phabricator_stopwords phabricator table in sql (innodb) [puppet] - 10https://gerrit.wikimedia.org/r/314286 (https://phabricator.wikimedia.org/T146673) [07:28:26] (03CR) 10Paladox: "I guess it's this http://stackoverflow.com/questions/13579810/how-to-import-data-from-text-file-to-mysql-database ?" [puppet] - 10https://gerrit.wikimedia.org/r/314286 (https://phabricator.wikimedia.org/T146673) (owner: 10Paladox) [07:32:47] RECOVERY - puppet last run on mw1163 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:33:57] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:40:04] !log Dropping tables in S1.enwiki - T57676 [07:40:05] T57676: Drop database table "email_capture" from Wikimedia wikis - https://phabricator.wikimedia.org/T57676 [07:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:43:00] !log upgrading labtestvirt2001 to Linux 4.4 [07:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:50:33] (03CR) 10Jcrespo: "This is exactly what we need, but it requires a manual restart after deploy. Will handle it later." [puppet] - 10https://gerrit.wikimedia.org/r/314465 (https://phabricator.wikimedia.org/T135521) (owner: 10Dereckson) [07:51:17] 06Operations, 10DBA: Drop database table "email_capture" from Wikimedia wikis - https://phabricator.wikimedia.org/T57676#2696041 (10Marostegui) Table in S1 has been deleted. [07:57:18] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [08:00:17] (03PS2) 10Giuseppe Lavagetto: scap_source: use one provider, pass "origin" as a parameter [puppet] - 10https://gerrit.wikimedia.org/r/314295 [08:00:19] (03PS2) 10Giuseppe Lavagetto: scap_source: enforce the origin url [puppet] - 10https://gerrit.wikimedia.org/r/314296 (https://phabricator.wikimedia.org/T143692) [08:01:43] (03CR) 10Volans: [C: 031] "LGTM. To be on the safe side you could run a full puppet compiler (with the new smaller catalogs should not take too much time/space) to e" [puppet] - 10https://gerrit.wikimedia.org/r/312206 (https://phabricator.wikimedia.org/T147403) (owner: 10Giuseppe Lavagetto) [08:02:41] <_joe_> volans: https://puppet-compiler.wmflabs.org/4209/\ [08:02:44] !log Dropping tables in S3.testwiki - T57676 [08:02:45] T57676: Drop database table "email_capture" from Wikimedia wikis - https://phabricator.wikimedia.org/T57676 [08:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:03:10] (03CR) 10Giuseppe Lavagetto: [C: 032] scap_source: use one provider, pass "origin" as a parameter [puppet] - 10https://gerrit.wikimedia.org/r/314295 (owner: 10Giuseppe Lavagetto) [08:03:53] _joe_: I saw that one from filippo, but was only on 3 hosts of the same cluster ;) [08:06:55] (03CR) 10Giuseppe Lavagetto: [C: 032] scap_source: enforce the origin url [puppet] - 10https://gerrit.wikimedia.org/r/314296 (https://phabricator.wikimedia.org/T143692) (owner: 10Giuseppe Lavagetto) [08:08:43] (03PS1) 10Muehlenhoff: Re-enable HHVM Icinga checks for jessie [puppet] - 10https://gerrit.wikimedia.org/r/314507 [08:11:50] <_joe_> oh, sigh :( [08:14:25] (03PS1) 10Giuseppe Lavagetto: scap::sources: fix repository for changeprop [puppet] - 10https://gerrit.wikimedia.org/r/314508 [08:15:18] (03CR) 10Giuseppe Lavagetto: [C: 031] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/314507 (owner: 10Muehlenhoff) [08:15:34] (03CR) 10Giuseppe Lavagetto: [C: 032] scap::sources: fix repository for changeprop [puppet] - 10https://gerrit.wikimedia.org/r/314508 (owner: 10Giuseppe Lavagetto) [08:16:07] (03PS5) 10Paladox: Create a phabricator_stopwords phabricator table in sql (innodb) [puppet] - 10https://gerrit.wikimedia.org/r/314286 (https://phabricator.wikimedia.org/T146673) [08:16:26] (03PS6) 10Paladox: Create a phabricator_stopwords phabricator table in sql (innodb) [puppet] - 10https://gerrit.wikimedia.org/r/314286 (https://phabricator.wikimedia.org/T146673) [08:17:18] (03CR) 10Paladox: Create a phabricator_stopwords phabricator table in sql (innodb) (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/314286 (https://phabricator.wikimedia.org/T146673) (owner: 10Paladox) [08:25:11] !log restarted hhvm on mw1213 [08:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:26:48] RECOVERY - HHVM rendering on mw1213 is OK: HTTP OK: HTTP/1.1 200 OK - 70681 bytes in 0.143 second response time [08:26:59] !log Restarted apache on iridium to apply hotfix to phab calendar form. refs T147525 [08:27:01] T147525: "Create calendar event" form has broken default date+time values, times out when trying to use date picker widget - https://phabricator.wikimedia.org/T147525 [08:27:02] RECOVERY - Apache HTTP on mw1213 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.051 second response time [08:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:34:31] 06Operations, 10DBA: Drop database table "email_capture" from Wikimedia wikis - https://phabricator.wikimedia.org/T57676#2696076 (10Marostegui) Table in S3 has been deleted. I believe this ticket can be closed. Looks like email_capture isn't present in any other shard. [08:34:35] (03PS1) 10Giuseppe Lavagetto: role::deployment::server: create $base_path variable [puppet] - 10https://gerrit.wikimedia.org/r/314511 [08:42:58] 06Operations, 10Traffic, 10Wikimedia-Apache-configuration, 13Patch-For-Review: Sometimes apache error 503s redirect to /503.html and this redirect gets cached - https://phabricator.wikimedia.org/T109226#2696081 (10elukey) Another idea! The downside of adding special rules in the main httpd config file imo... [08:46:30] (03CR) 10Giuseppe Lavagetto: [C: 032] role::deployment::server: create $base_path variable [puppet] - 10https://gerrit.wikimedia.org/r/314511 (owner: 10Giuseppe Lavagetto) [08:53:11] (03PS1) 10Giuseppe Lavagetto: role::icinga: define facilities on just one host [puppet] - 10https://gerrit.wikimedia.org/r/314512 [08:53:26] !log reimaging mw1212-mw1214 to jessie [08:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:53:34] <_joe_> moritzm: ^^ [08:54:11] !log jessie dist-upgrade on cp* cache hosts [08:54:14] <_joe_> moritzm: I'm pretty sure that shameful patch will solve the problem [08:54:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:55:03] (03CR) 10Giuseppe Lavagetto: [C: 032] role::icinga: define facilities on just one host [puppet] - 10https://gerrit.wikimedia.org/r/314512 (owner: 10Giuseppe Lavagetto) [08:56:08] ack, looks like it should fix it [09:05:05] 06Operations, 05Prometheus-metrics-monitoring: Upgrade mysqld_exporter to 0.9.0 - https://phabricator.wikimedia.org/T147476#2696087 (10fgiunchedi) @jcrespo sounds good to me, thanks! I've uploaded 0.9.0 both to Debian official and internally for jessie/trusty. It should be available for upgrade now everywhere.... [09:06:08] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [09:06:27] looking ^ [09:08:47] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [09:08:50] (03PS1) 10Muehlenhoff: noc: Also use HHVM on jessie [puppet] - 10https://gerrit.wikimedia.org/r/314514 [09:08:58] PROBLEM - puppet last run on wtp1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:11:08] PROBLEM - puppet last run on cp2005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[openssl] [09:12:58] ema: it's you running dist-upgrade, /var/lib/dpkg/lock make puppet fails ^^^ [09:13:29] sorry was supposed to be questions :) [09:13:38] s/questions/a question/ [09:13:48] yes that's the upgrade [09:15:08] yeah, the Icinga check for dpkg is a little edgy, it sometimes also fails when "apt-get update" is running... [09:15:52] this was just plain puppet failure [09:15:55] 06Operations, 10ops-eqiad, 13Patch-For-Review: Add new disks to syslog server in eqiad (lithium) - https://phabricator.wikimedia.org/T143307#2696109 (10fgiunchedi) I see lithium still stuck at ``` Scanning for devices. Please wait, this may take several minutes... ``` so likely a reseat or sth like that... [09:18:58] RECOVERY - puppet last run on cp2005 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [09:19:12] (03PS1) 10Marostegui: db-eqiad.php: Restoring normal weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314515 (https://phabricator.wikimedia.org/T145533) [09:19:19] volans: just a transient failure, running puppet again fixed it [09:19:38] yeah I was expecting nothing less :-P [09:20:11] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restoring normal weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314515 (https://phabricator.wikimedia.org/T145533) (owner: 10Marostegui) [09:20:36] (03Merged) 10jenkins-bot: db-eqiad.php: Restoring normal weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314515 (https://phabricator.wikimedia.org/T145533) (owner: 10Marostegui) [09:22:32] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restoring db1082 original weight: 500 (duration: 00m 52s) [09:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:24:59] 06Operations, 10ops-codfw, 10ops-eqiad, 10media-storage: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756#2511148 (10Marostegui) @fgiunchedi I would like to get db1082 upgraded as Moritz mentioned in: T145533 So far I have repooled it as it has been out for a... [09:25:58] 06Operations, 10DBA, 13Patch-For-Review: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2633433 (10Marostegui) For now I have restored its original value until we agreed on when we can upgrade it. So far it has been behaving fine since it crashed around a month ago. [09:27:42] 06Operations, 10Traffic, 10Wikimedia-Apache-configuration, 13Patch-For-Review: Sometimes apache error 503s redirect to /503.html and this redirect gets cached - https://phabricator.wikimedia.org/T109226#2696123 (10elukey) >>! In T109226#1543444, @ori wrote: > The root cause rests with the interaction of th... [09:28:00] (03PS1) 10Muehlenhoff: Debian moved back to firefox, stop using iceweasel [puppet] - 10https://gerrit.wikimedia.org/r/314516 [09:32:30] RECOVERY - puppet last run on wtp1003 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [09:36:47] !log adding mw1208 and mw1221 back to the api appservers live pool [09:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:40:40] 06Operations, 10ops-codfw, 10ops-eqiad, 10media-storage: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756#2696141 (10fgiunchedi) sure @Marostegui ! Once you have the firmware from the links above for the right controller (`hpssacli controller all show`) you c... [09:41:24] (03CR) 10Volans: "First pass, a couple of comments." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/314247 (owner: 10Ema) [09:43:58] (03PS1) 10Giuseppe Lavagetto: role::icinga: hack to fix icinga config [puppet] - 10https://gerrit.wikimedia.org/r/314517 [09:44:30] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] role::icinga: hack to fix icinga config [puppet] - 10https://gerrit.wikimedia.org/r/314517 (owner: 10Giuseppe Lavagetto) [09:45:25] (03PS1) 10Muehlenhoff: elasticsearch: Extend version check to also apply to jessie [puppet] - 10https://gerrit.wikimedia.org/r/314518 [09:55:08] !log installing jackrabbit security updates on Ubuntu and Debian systems [09:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:57:37] !log cp1046 cp2015 depooled reboot for kernel upgrades [09:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:59:36] (03PS1) 10Elukey: Update the HHVM version for X-Powered-By (static websites) [puppet] - 10https://gerrit.wikimedia.org/r/314519 [10:00:25] (03CR) 10Elukey: "Do we need to keep this header? :)" [puppet] - 10https://gerrit.wikimedia.org/r/314519 (owner: 10Elukey) [10:02:47] !log reimaging mw122[23] to Debian jessie (api appservers) [10:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:03:07] PROBLEM - Host cp2015 is DOWN: PING CRITICAL - Packet loss = 100% [10:04:55] cp1046 came back online fine, cp2015 didn't ^ [10:05:38] (03PS7) 10Paladox: Create a phabricator_stopwords phabricator table in sql (innodb) [puppet] - 10https://gerrit.wikimedia.org/r/314286 (https://phabricator.wikimedia.org/T146673) [10:06:29] (03PS1) 10Giuseppe Lavagetto: naggen2: do not output duplicate resources [puppet] - 10https://gerrit.wikimedia.org/r/314522 [10:06:38] <_joe_> moritzm: ^^ [10:06:46] <_joe_> still needs to be tested, but that's the idea [10:09:27] !log power cycling cp2015, reboot failed [10:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:11:03] RECOVERY - Host cp2015 is UP: PING OK - Packet loss = 0%, RTA = 36.63 ms [10:11:29] (03CR) 10Volans: [C: 04-1] naggen2: do not output duplicate resources (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/314522 (owner: 10Giuseppe Lavagetto) [10:13:11] (03CR) 10Giuseppe Lavagetto: naggen2: do not output duplicate resources (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/314522 (owner: 10Giuseppe Lavagetto) [10:14:03] (03PS2) 10Giuseppe Lavagetto: naggen2: do not output duplicate resources [puppet] - 10https://gerrit.wikimedia.org/r/314522 [10:16:56] !log reimaging mw1209, mw1210, mw1215 to jessie [10:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:18:39] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/314522 (owner: 10Giuseppe Lavagetto) [10:22:23] (03PS3) 10Giuseppe Lavagetto: naggen2: do not output duplicate resources [puppet] - 10https://gerrit.wikimedia.org/r/314522 [10:26:29] PROBLEM - salt-minion processes on puppetmaster1001 is CRITICAL: PROCS CRITICAL: 6 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:27:10] !log cache_maps: rolling reboots for kernel upgrades [10:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:28:29] <_joe_> so, the naggen patch just repairs part of the issue, not all of it [10:28:43] <_joe_> so let's attack this the hacky way for now [10:29:01] <_joe_> I mean the naggen patch will fix part of the problem, so let's do it. [10:29:01] what's missing? [10:29:08] <_joe_> volans: service definitions [10:29:14] <_joe_> they'll be duplicated [10:29:16] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/314522 (owner: 10Giuseppe Lavagetto) [10:29:27] <_joe_> let's merge anyways [10:29:36] (03CR) 10Giuseppe Lavagetto: [C: 032] naggen2: do not output duplicate resources [puppet] - 10https://gerrit.wikimedia.org/r/314522 (owner: 10Giuseppe Lavagetto) [10:32:52] (03CR) 10Dereckson: [C: 031] "ISBN and RFC are sensible, https://en.wikipedia.org/wiki/Wikipedia:PMID documents the use of PMID, https://en.wikipedia.org/wiki/Help:Magi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314463 (https://phabricator.wikimedia.org/T147536) (owner: 10Legoktm) [10:33:39] I'm not sure if it's related to Ops but https://wikimedia.de/ is down [10:34:45] Amir1: that is hosted by WMDE apparently. Maybe reach #wikidata ? [10:34:47] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: thumbor memory limits for main process and subprocesses - https://phabricator.wikimedia.org/T145623#2696206 (10fgiunchedi) @gilles good question, I don't think we have a good way to pull metrics from logs yet. I was meaning to try https://git... [10:35:03] hashar: they all are in meeting (including me) [10:35:03] :D [10:35:34] let's do it at the end of the meeting [10:35:43] Hello. [10:36:03] Amir1: wikimedia.de isn't managed by WMF. [10:36:26] okay, noted down thanks [10:40:38] PROBLEM - HHVM processes on mw1188 is CRITICAL: NRPE: Command check_hhvm not defined [10:42:02] 06Operations, 10ops-eqiad, 13Patch-For-Review: Add new disks to syslog server in eqiad (lithium) - https://phabricator.wikimedia.org/T143307#2696210 (10Cmjohnson) @fgiunchedi The disks are fine, the bios sees them correctly and during this morning's attempt to install Jessie, I was able to see the offer/req... [10:42:27] <_joe_> ok icinga config is refreshed [10:42:31] (03PS2) 10Dereckson: Add ec.wikimedia.org to Apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/314470 (https://phabricator.wikimedia.org/T135521) [10:42:33] (03PS2) 10Dereckson: Sort by alphabetical order wikimedia-chapter Apache sites [puppet] - 10https://gerrit.wikimedia.org/r/314469 [10:42:36] <_joe_> still need to fix the double-checks [10:43:17] (03PS3) 10Dereckson: Apache configuration for pt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/270479 (https://phabricator.wikimedia.org/T126832) [10:44:17] (03CR) 10Dereckson: "PS3: rebased against Ib173c88f and I209e566c53 to avoid merge conflict." [puppet] - 10https://gerrit.wikimedia.org/r/270479 (https://phabricator.wikimedia.org/T126832) (owner: 10Dereckson) [10:44:42] silencing the HHVM check, that's fallout of the stale icinga config (and indirectly fixed by https://gerrit.wikimedia.org/r/#/c/314507/) [10:44:44] (03CR) 10jenkins-bot: [V: 04-1] Apache configuration for pt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/270479 (https://phabricator.wikimedia.org/T126832) (owner: 10Dereckson) [10:44:54] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [10:45:14] RECOVERY - salt-minion processes on puppetmaster1001 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:46:55] (03PS4) 10Dereckson: Apache configuration for pt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/270479 (https://phabricator.wikimedia.org/T126832) [10:47:11] (03CR) 10Dereckson: "PS4: redirects.conf refreshed" [puppet] - 10https://gerrit.wikimedia.org/r/270479 (https://phabricator.wikimedia.org/T126832) (owner: 10Dereckson) [10:48:01] Hi Urbanecm. [10:50:24] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:50:52] !log change-prop deploying 403eec8 [10:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:51:33] RECOVERY - MegaRAID on db1055 is OK: OK: optimal, 1 logical, 2 physical [10:51:34] 06Operations, 10ops-eqiad, 10DBA: db1055: degraded array - https://phabricator.wikimedia.org/T147172#2696224 (10Marostegui) The rebuilt process finished successfully this time - so it was indeed the disk as you said: ``` Device Present ================ Virtual Drives : 1... [10:51:46] 06Operations, 10ops-eqiad, 10DBA: db1055: degraded array - https://phabricator.wikimedia.org/T147172#2696225 (10Marostegui) 05Open>03Resolved [10:52:28] 06Operations, 06Performance-Team, 10Thumbor: thumbor ffmpeg pipe deadlock - https://phabricator.wikimedia.org/T145626#2696226 (10fgiunchedi) IIRC it was pretty obvious/frequent when it happened so I suppose we'd be seeing it by now already. Anyways I used the following commands to test a hit/miss and it seem... [10:59:22] 06Operations, 10Mail, 10OTRS: OTRS spam classification methods and systems - https://phabricator.wikimedia.org/T146968#2676399 (10pajz) Now, I can't say anything definite given the relevant servers are operated by the WMF, so I suppose only they'd be able to provide perfectly up-to-date information, but let... [11:00:13] PROBLEM - check_mysql on frdb1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1289 [11:03:04] (03PS4) 10Dereckson: New 'engineer' group for ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308448 (https://phabricator.wikimedia.org/T144599) (owner: 10MarcoAurelio) [11:04:04] (03CR) 10Dereckson: [C: 031] New 'engineer' group for ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308448 (https://phabricator.wikimedia.org/T144599) (owner: 10MarcoAurelio) [11:05:13] RECOVERY - check_mysql on frdb1001 is OK: Uptime: 163557 Threads: 1 Questions: 31493578 Slow queries: 1315 Opens: 2034 Flush tables: 1 Open tables: 587 Queries per second avg: 192.554 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [11:07:50] PROBLEM - NTP on cp4020 is CRITICAL: NTP CRITICAL: Offset unknown [11:12:30] !log added mw122[23] back to the api appservers live pool [11:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:14:10] 06Operations, 10ops-eqiad: dbstore1001: check drive bays - https://phabricator.wikimedia.org/T145389#2696252 (10Cmjohnson) 05Open>03Resolved [11:14:34] !log reimaging mw122[34] to Debian Jessie [11:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:16:13] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:19:52] Anyway, it's up now :D [11:21:22] RECOVERY - NTP on cp4020 is OK: NTP OK: Offset -0.001669406891 secs [11:22:34] (03PS7) 10MarcoAurelio: Initial configuration for olo.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312807 (https://phabricator.wikimedia.org/T146612) [11:22:43] (03PS3) 10MarcoAurelio: Labs configuration for olo.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/312812 (https://phabricator.wikimedia.org/T146612) [11:24:04] 06Operations, 10ops-eqiad, 10DBA: db1065: Degraded RAID - https://phabricator.wikimedia.org/T147396#2691693 (10Cmjohnson) Disk has been requested through the Dell portal. Confirmed: Request 937313705 was successfully submitted. Your service request has been successfully created and will be reviewed by our... [11:24:50] (03PS1) 10Giuseppe Lavagetto: role::icinga: declare common resources only on the primary server [puppet] - 10https://gerrit.wikimedia.org/r/314537 [11:26:34] (03CR) 10MarcoAurelio: "> Code ok, but theres the consensus question, see task." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313601 (https://phabricator.wikimedia.org/T147063) (owner: 10MarcoAurelio) [11:27:06] (03PS2) 10MarcoAurelio: Enable Extension:ShortURL on bd.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311656 (https://phabricator.wikimedia.org/T146014) [11:29:17] PROBLEM - MariaDB Slave SQL: s2 on db1063 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error Table bgwiki.hitcounter doesnt exist on query. Default database: bgwiki. Query: [snipped] [11:29:17] PROBLEM - MariaDB Slave SQL: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error Table bgwiki.hitcounter doesnt exist on query. Default database: bgwiki. Query: [snipped] [11:29:33] That is me - I will get that fixed [11:30:33] marostegui: need help? [11:31:15] I wonder why that failed - nothing should be using it [11:31:19] I am recreating it now [11:32:00] PROBLEM - IPsec on cp1047 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp4019_v4, cp4019_v6 [11:32:01] PROBLEM - IPsec on cp1060 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp4019_v4, cp4019_v6 [11:32:08] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp4019_v4, cp4019_v6 [11:33:21] ema: ^^^ [11:34:12] volans: yep I was looking into this. cp4019 is rebooting [11:35:01] RECOVERY - IPsec on cp1060 is OK: Strongswan OK - 24 ESP OK [11:35:02] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 148 ESP OK [11:35:29] db1063 is now fixed [11:35:32] 06Operations, 10ops-eqiad, 13Patch-For-Review: Add new disks to syslog server in eqiad (lithium) - https://phabricator.wikimedia.org/T143307#2696271 (10fgiunchedi) Indeed that's odd @Cmjohnson I can see the dhcp offers from _both_ cr1 and cr2 in eqiad coming in a roughly the same time ``` Oct 6 11:33:10 ca... [11:35:33] going to fix dbstore1002 [11:35:42] ok [11:36:12] Looks like the code tries a delete from with a table that is empty and should not be used :_( [11:36:42] did you restart any DB recently? [11:36:48] nop [11:36:49] !log restbase deploy start of fa4dc79 [11:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:36:57] I had something like that, trying to remember when/where [11:37:09] marostegui: recently last few days [11:37:40] PROBLEM - IPsec on cp2009 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp3006_v4, cp3006_v6 [11:37:50] RECOVERY - MariaDB Slave SQL: s2 on db1063 is OK: OK slave_sql_state Slave_SQL_Running: Yes [11:37:58] dbstore1002 is now fixed [11:38:00] same, cp3006 is rebooting ^ [11:38:33] volans: not really no - certainly not these two (db1063 and dbstore1002) [11:38:59] (03PS1) 10Dereckson: Logo update for pt.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314539 (https://phabricator.wikimedia.org/T126832) [11:41:11] RECOVERY - IPsec on cp2009 is OK: Strongswan OK - 36 ESP OK [11:41:23] RECOVERY - IPsec on cp1047 is OK: Strongswan OK - 24 ESP OK [11:44:17] (03CR) 10Dereckson: [C: 04-1] "Transparent background issue." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314539 (https://phabricator.wikimedia.org/T126832) (owner: 10Dereckson) [11:45:01] RECOVERY - MariaDB Slave SQL: s2 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [11:50:38] !log rebooting video scalers for kernel security update [11:50:39] (03PS3) 10ArielGlenn: openstack: Update monitor_labs_salt_keys.py for new Nova API version [puppet] - 10https://gerrit.wikimedia.org/r/309709 (https://phabricator.wikimedia.org/T123607) (owner: 10Alex Monk) [11:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:51:49] (03CR) 10ArielGlenn: [C: 032] openstack: Update monitor_labs_salt_keys.py for new Nova API version [puppet] - 10https://gerrit.wikimedia.org/r/309709 (https://phabricator.wikimedia.org/T123607) (owner: 10Alex Monk) [11:52:47] 06Operations, 10scap, 13Patch-For-Review, 03Scap3, and 2 others: Scap::server::sources is out of sync with the repositories actually present on tin/mira - https://phabricator.wikimedia.org/T143692#2696285 (10Joe) So, apart from servermon, which points to a (still) inexistent url, all other services that us... [11:52:58] 06Operations, 10scap, 13Patch-For-Review, 03Scap3, and 2 others: Scap::server::sources is out of sync with the repositories actually present on tin/mira - https://phabricator.wikimedia.org/T143692#2696286 (10Joe) 05Open>03Resolved [11:57:15] !log restbase deploy end of fa4dc79 [11:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:00:20] (03PS7) 10Giuseppe Lavagetto: hiera: always search for the full key [puppet] - 10https://gerrit.wikimedia.org/r/312206 (https://phabricator.wikimedia.org/T147403) [12:01:13] (03PS2) 10Dereckson: Logo for pt.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314539 (https://phabricator.wikimedia.org/T126832) [12:04:06] (03PS3) 10Dereckson: Logo for pt.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314539 (https://phabricator.wikimedia.org/T126832) [12:05:21] (03CR) 10Dereckson: "PS2: fix transparency issue using ImageMagick 6.8 instead of 6.9." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314539 (https://phabricator.wikimedia.org/T126832) (owner: 10Dereckson) [12:10:20] (03CR) 10Giuseppe Lavagetto: "I decided to split the patch in two parts, for avoiding race conditions." [puppet] - 10https://gerrit.wikimedia.org/r/312206 (https://phabricator.wikimedia.org/T147403) (owner: 10Giuseppe Lavagetto) [12:14:34] PROBLEM - puppet last run on cp3022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:19:25] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:22:53] (03PS2) 10Dereckson: Disable Upload Wizard blacklist issues on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312735 (https://phabricator.wikimedia.org/T146417) [12:23:49] (03PS2) 10Dereckson: Configure Visual Editor namespaces on sv.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309808 (https://phabricator.wikimedia.org/T144688) [12:33:34] !log adding mw122[45] back to the live api appservers pool (note: mw1224 was pooled => no before the reimage, but I don't see any blocker in adding it back to serve live traffic) [12:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:36:25] (03PS1) 10Cmjohnson: Removing dns entries for snapshot 1002/3/4 T141762 [dns] - 10https://gerrit.wikimedia.org/r/314541 [12:36:46] !log reimaging mw122[67] to Debian Jessie [12:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:38:03] (03CR) 10Cmjohnson: [C: 032] Removing dns entries for snapshot 1002/3/4 T141762 [dns] - 10https://gerrit.wikimedia.org/r/314541 (owner: 10Cmjohnson) [12:38:19] RECOVERY - puppet last run on cp3022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:39:32] 06Operations, 10ops-eqiad, 10hardware-requests, 13Patch-For-Review: decommission snapshot1002, 1003, 1004 - https://phabricator.wikimedia.org/T141762#2696316 (10Cmjohnson) [12:45:44] RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [12:47:22] (03PS1) 10Elukey: Decommission the old AQS cluster [puppet] - 10https://gerrit.wikimedia.org/r/314542 (https://phabricator.wikimedia.org/T147461) [12:50:07] !log cache_misc: rolling reboots for kernel upgrades [12:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:54:12] hashar: it's a full house for eu swat today :) [13:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161006T1300). [13:00:04] mobrovac, dcausse, Dereckson, and Dereckson: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:23] =o [13:00:37] o/ [13:02:25] zeljkof: I think hashar might still be out shopping? :D [13:02:30] Hi. [13:02:35] I can SWAT this morning. [13:02:47] zeljkof: oh you were going to? [13:03:05] Dereckson: no no, take it :) [13:03:22] ok [13:03:25] I was just about to ask who is in mood for swat [13:03:36] mobrovac: ping? [13:03:47] (03PS3) 10Dereckson: Initialize subphrases autocomplete on wikisources, mw.org and wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314255 (https://phabricator.wikimedia.org/T146208) (owner: 10DCausse) [13:03:56] i'm here [13:04:21] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314255 (https://phabricator.wikimedia.org/T146208) (owner: 10DCausse) [13:04:24] (03CR) 10Elukey: [C: 031] Re-enable HHVM Icinga checks for jessie [puppet] - 10https://gerrit.wikimedia.org/r/314507 (owner: 10Muehlenhoff) [13:04:47] (03Merged) 10jenkins-bot: Initialize subphrases autocomplete on wikisources, mw.org and wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314255 (https://phabricator.wikimedia.org/T146208) (owner: 10DCausse) [13:06:26] dcausse: live on mw1099 [13:06:53] Dereckson: it affects only main scripts, is it ok to run mwscript on mw1099 for a quick test? [13:07:20] dcausse: yes [13:07:22] ok [13:09:23] Dereckson: all good [13:10:37] ok [13:11:55] !log dereckson@tin Synchronized tests/cirrusTest.php: Initialize subphrases autocomplete on wikisources, mw.org and wikitech (T146208, 1/3, no-op in prod part) (duration: 00m 50s) [13:11:56] T146208: Enable sub-phrase completion suggester on wikitech, mediawiki.org and wikisource - https://phabricator.wikimedia.org/T146208 [13:12:02] Dereckson: thanks! [13:12:28] zeljkof: hey, look https://gerrit.wikimedia.org/r/#/c/314255, this is a patch where order matters [13:12:44] zeljkof: CirrusSearch-common.php uses a new variable, defined in InitialiseSettings.php [13:12:56] so it's one of the cases we need to sync IS first, the file using it afterwards [13:14:04] Dereckson: thnx for the merge [13:14:10] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Initialize subphrases autocomplete on wikisources, mw.org and wikitech (T146208, 2/3) (duration: 00m 49s) [13:15:02] o/ [13:15:10] back, I was doing some grocery shopping [13:15:48] Hi hashar. I sympathize, was at the bank at 14:50. [13:15:59] !log dereckson@tin Synchronized wmf-config/CirrusSearch-common.php: Initialize subphrases autocomplete on wikisources, mw.org and wikitech (T146208, 3/3) (duration: 00m 49s) [13:16:11] Dereckson: looking... [13:16:27] and despite the correct sync order, we still in the logs [13:16:28] 11 Notice: Undefined variable: wmgCirrusSearchCompletionSuggesterSubphrases in /srv/mediawiki/wmf-config/CirrusSearch-common.php on line 477 [13:16:41] :/ [13:16:42] (but only 11, when order isn't respected is more like 3000) [13:16:53] ok :) [13:18:41] mobrovac: live on mw1099 [13:20:01] Dereckson: the patch is about deferred updates, so can't really test it on one host [13:20:13] Dereckson: tested in beta earlier and all looked good [13:20:21] so you can continue SWATting safely [13:20:42] And logs doesn't report anything wrong. [13:20:54] yup [13:22:10] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002 and stat1004 for nschaaf - https://phabricator.wikimedia.org/T146924#2696365 (10schana) @Nuria verified [13:23:18] (03PS3) 10Dereckson: Disable Upload Wizard blacklist issues on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312735 (https://phabricator.wikimedia.org/T146417) [13:23:19] !log dereckson@tin Synchronized php-1.28.0-wmf.21/extensions/EventBus: Send a resource_change event on page_image property change (T145569) (duration: 00m 48s) [13:23:21] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2696368 (10Gilles) [13:23:24] 06Operations, 06Performance-Team, 10Thumbor: thumbor ffmpeg pipe deadlock - https://phabricator.wikimedia.org/T145626#2696367 (10Gilles) 05Open>03Resolved [13:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:23:26] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312735 (https://phabricator.wikimedia.org/T146417) (owner: 10Dereckson) [13:24:00] (03Merged) 10jenkins-bot: Disable Upload Wizard blacklist issues on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312735 (https://phabricator.wikimedia.org/T146417) (owner: 10Dereckson) [13:24:23] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: thumbor memory limits for main process and subprocesses - https://phabricator.wikimedia.org/T145623#2696374 (10Gilles) That looks like it could do the job, I'll try it out and see if it can be easily backported. [13:25:07] MatmaRex: ping? [13:25:34] (03PS3) 10Gilles: Separate Thumbor 404s into their own log [puppet] - 10https://gerrit.wikimedia.org/r/313899 [13:26:32] hi [13:27:02] MatmaRex: I've pulled on mw1099 Disable Upload Wizard blacklist issues on Commons [13:27:13] any quick idea to test it? [13:27:59] (my idea was to touch a DSC0000.jpg file and try to upload it throug the wizard) [13:28:18] try to upload a file with a name that is blacklisted, and see that you don't have the dialog to submit a blacklist false positive [13:28:33] you might need to use a new account, or suckpuppet [13:29:03] since you'll bypass the blacklist if you have 'tboverride' user right, oor something. sysops have it, maybe some other groups too [13:29:04] (03CR) 10Elukey: [C: 031] "The only rule used is:" [puppet] - 10https://gerrit.wikimedia.org/r/314338 (owner: 10Andrew Bogott) [13:30:12] ah yes I'm sysop on commons [13:30:29] Could you test thazt if you've already a such new/test account? [13:35:44] 06Operations, 10ops-eqiad, 10DBA: db1065: Degraded RAID - https://phabricator.wikimedia.org/T147396#2696401 (10Cmjohnson) The Reference Dispatch Number is: 321760647 Your part dispatch will be delivered to the following location: Wikimedia c/o of Equinix, 21721 Filigree Ct. Cage 61130 Ashburn, VA 20147 [13:37:16] well, upload wizard still works, we can test that later [13:37:39] I'll leave an update on the task. [13:38:37] (03PS2) 10Andrew Bogott: Add $use_ssl switch to role::labs::novaproxy [puppet] - 10https://gerrit.wikimedia.org/r/314441 [13:38:48] !log dereckson@tin Synchronized wmf-config/CommonSettings.php: Disable Upload Wizard blacklist issues on Commons (T146417) (duration: 00m 49s) [13:38:49] T146417: Set $wgUploadWizardConfig['blacklistIssuesPage'] = ''; for commonswiki - https://phabricator.wikimedia.org/T146417 [13:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:38:57] 06Operations, 10ops-eqiad, 10DBA: db1065: Degraded RAID - https://phabricator.wikimedia.org/T147396#2696423 (10Marostegui) Awesome - thanks for the heads up [13:39:31] (03PS3) 10Dereckson: Configure Visual Editor namespaces on sv.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309808 (https://phabricator.wikimedia.org/T144688) [13:39:42] (03CR) 10jenkins-bot: [V: 04-1] Add $use_ssl switch to role::labs::novaproxy [puppet] - 10https://gerrit.wikimedia.org/r/314441 (owner: 10Andrew Bogott) [13:39:44] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309808 (https://phabricator.wikimedia.org/T144688) (owner: 10Dereckson) [13:40:12] (03Merged) 10jenkins-bot: Configure Visual Editor namespaces on sv.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309808 (https://phabricator.wikimedia.org/T144688) (owner: 10Dereckson) [13:40:21] (03CR) 10Andrew Bogott: "yeah, we can wait for Otto." [puppet] - 10https://gerrit.wikimedia.org/r/314338 (owner: 10Andrew Bogott) [13:40:33] 309808 live on mw1099 [13:41:31] (03PS3) 10Andrew Bogott: Add $use_ssl switch to role::labs::novaproxy [puppet] - 10https://gerrit.wikimedia.org/r/314441 [13:42:20] Works for Portal: [13:42:48] Doesn't work for Wikipedia: [13:43:27] PROBLEM - puppet last run on mw1162 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:46:40] I guess we should use canonical Project [13:49:22] Doesn't work either with Project [13:50:05] !log cache_text: rolling reboots for kernel upgrades [13:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:50:22] !log added mw122[67] back to the api appservers live pool [13:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:50:33] https://ht.wikipedia.org/wiki/Wikipedya:Foo?veaction=edit works well with project [13:52:12] okay I've it working on mw1099, I'm preparing a follow-up patch [13:54:02] !log citoid deploying 4d97774 [13:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:54:24] (03PS1) 10Dereckson: Fix namespace for sv.wikipedia Visual Editor configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314558 (https://phabricator.wikimedia.org/T144688) [13:54:47] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314558 (https://phabricator.wikimedia.org/T144688) (owner: 10Dereckson) [13:55:17] (03Merged) 10jenkins-bot: Fix namespace for sv.wikipedia Visual Editor configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314558 (https://phabricator.wikimedia.org/T144688) (owner: 10Dereckson) [13:56:52] fix live on mw1099 [13:57:05] Still working fine. [13:58:43] 06Operations, 10ops-esams, 10DNS, 10Traffic, 10netops: eeden ethernet outage - https://phabricator.wikimedia.org/T146391#2696596 (10faidon) 05Open>03Resolved a:03faidon Two weeks have passed and this hasn't reoccurred. I'm going to resolve this for now — we can reopen if it happens again or if we h... [13:58:59] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Configure Visual Editor namespaces on sv.wikipedia ([[gerrit:309808]] and [[gerrit:314558]], T144688) (duration: 00m 50s) [13:59:00] T144688: Enable Visual Editor in more namespaces on svwiki - https://phabricator.wikimedia.org/T144688 [13:59:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:59:43] 06Operations, 10Traffic: Consider per-route DCTCP for dc-local traffic on jessie hosts - https://phabricator.wikimedia.org/T128377#2696608 (10faidon) [14:00:04] (03PS1) 10Ema: upload storage: avoid cron restarts while rebooting [puppet] - 10https://gerrit.wikimedia.org/r/314560 (https://phabricator.wikimedia.org/T145661) [14:00:33] 06Operations, 06Labs, 10netops: Consider renumbering Labs to separate address spaces - https://phabricator.wikimedia.org/T122406#2696632 (10faidon) We agreed on all of the above during the Barcelona offsite. We've preliminary agreed to attempt implementing them in tandem with the Neutron migration, which wou... [14:03:31] hashar: will the train deploy wmf21 on group2 today or is there still blockers? [14:09:05] (03PS5) 10Dereckson: New 'engineer' group for ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308448 (https://phabricator.wikimedia.org/T144599) (owner: 10MarcoAurelio) [14:09:31] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308448 (https://phabricator.wikimedia.org/T144599) (owner: 10MarcoAurelio) [14:09:41] RECOVERY - puppet last run on mw1162 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:10:08] (03CR) 10BBlack: [C: 032] upload storage: avoid cron restarts while rebooting [puppet] - 10https://gerrit.wikimedia.org/r/314560 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema) [14:11:04] (03CR) 10Elukey: [C: 031] vk::webrequest - adjust peak rate estimates [puppet] - 10https://gerrit.wikimedia.org/r/314336 (owner: 10BBlack) [14:11:09] SWAT status: all done, excepted engineer group for ruwiki, we're waiting zuul (2 Jenkins tasks remaining) to go on. [14:12:10] (03PS1) 10Hashar: contint: add phpdbg for code coverage [puppet] - 10https://gerrit.wikimedia.org/r/314563 [14:14:20] 06Operations, 10Traffic: Consider per-route DCTCP for dc-local traffic on jessie hosts - https://phabricator.wikimedia.org/T128377#2696723 (10BBlack) 05Open>03declined Per-route congestion control is complicated, and DCTCP requires ECN support from our network gear, and may not play nice with other concurr... [14:14:45] Okay, WikimediaMessages wmf cherry-pick are merged. [14:17:03] 06Operations, 10Traffic, 15User-Joe, 07discovery-system: Upgrade conftool to 0.3.1 - https://phabricator.wikimedia.org/T147480#2693856 (10Joe) a:03Joe [14:17:50] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/4227/ show changes for aqs100[456] only in Ferm rules (as expected)." [puppet] - 10https://gerrit.wikimedia.org/r/314542 (https://phabricator.wikimedia.org/T147461) (owner: 10Elukey) [14:18:20] (03CR) 10Dereckson: [C: 032] "SWAT, take two." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308448 (https://phabricator.wikimedia.org/T144599) (owner: 10MarcoAurelio) [14:18:43] (03PS2) 10Jcrespo: Add ecwikimedia to the list of private wikis [puppet] - 10https://gerrit.wikimedia.org/r/314465 (https://phabricator.wikimedia.org/T135521) (owner: 10Dereckson) [14:18:47] (03Merged) 10jenkins-bot: New 'engineer' group for ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308448 (https://phabricator.wikimedia.org/T144599) (owner: 10MarcoAurelio) [14:19:51] l10n + 308448 live on mw1099 [14:20:46] (03CR) 10Jcrespo: [C: 032] Add ecwikimedia to the list of private wikis [puppet] - 10https://gerrit.wikimedia.org/r/314465 (https://phabricator.wikimedia.org/T135521) (owner: 10Dereckson) [14:20:50] Works fine. [14:21:17] (config part, l10n will need a cache rebuild, but that can wait this evening l10nupdate run) [14:21:59] (or I can run the cache rebuild afterwards) [14:23:31] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: New 'engineer' group for ruwiki (T144599) (duration: 00m 52s) [14:23:32] T144599: New "engineer" usergroup for ruwiki - https://phabricator.wikimedia.org/T144599 [14:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:25:00] !log dereckson@tin Synchronized php-1.28.0-wmf.20/extensions/WikimediaMessages/i18n/wikimedia: Wikimedia messages for new 'engineer' group for ruwiki (T144599) (duration: 00m 49s) [14:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:26:00] !log dereckson@tin Synchronized php-1.28.0-wmf.21/extensions/WikimediaMessages/i18n/wikimedia: Wikimedia messages for new 'engineer' group for ruwiki (T144599) (duration: 00m 49s) [14:26:00] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 677 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3057420 keys - replication_delay is 677 [14:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:28:20] SWAT done. [14:28:39] Dereckson: merci :) [14:29:23] (03PS1) 10Muehlenhoff: Remove ca.patch, now obsolete [debs/openssl] - 10https://gerrit.wikimedia.org/r/314566 [14:30:54] Dereckson, hi, it seems we are still on wmf20 in group1 - do you know if its due to https://phabricator.wikimedia.org/T145220, or it simply didn't go through? [14:31:28] yurik: yes, those tracking tasks indeed note the blockers [14:32:05] do we know if the train is likely to finish this week? [14:33:01] That's a question I've too. [14:33:18] !log cache_upload: rolling reboots for kernel upgrades [14:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:33:49] (03PS1) 10Muehlenhoff: Update cloudflare patch for 1.0.2i [debs/openssl] - 10https://gerrit.wikimedia.org/r/314567 [14:40:21] <_joe_> !log uploaded conftool 0.3.1 to apt.w.o, T147480 [14:40:22] T147480: Upgrade conftool to 0.3.1 - https://phabricator.wikimedia.org/T147480 [14:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:41:29] !log restarting db1069:3133 mysql instance [14:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:44:26] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, should make also inventory/audit easier" [puppet] - 10https://gerrit.wikimedia.org/r/314450 (https://phabricator.wikimedia.org/T84518) (owner: 10Dzahn) [14:47:17] (03PS2) 10Dzahn: noc: Also use HHVM on jessie [puppet] - 10https://gerrit.wikimedia.org/r/314514 (owner: 10Muehlenhoff) [14:47:48] 06Operations, 10ops-eqiad: upgrade package_builder machine with SSD - https://phabricator.wikimedia.org/T130759#2696798 (10fgiunchedi) Announced on ops@ list -- Monday 10th we'll be reinstalling with SSDs [14:48:25] 06Operations, 10ops-eqiad: Broken disk on copper - https://phabricator.wikimedia.org/T144261#2696799 (10fgiunchedi) 05Open>03declined See latest update on T130759, SSD upgrade scheduled to happen on Monday 10th [14:48:54] (03CR) 10Dzahn: [C: 032] noc: Also use HHVM on jessie [puppet] - 10https://gerrit.wikimedia.org/r/314514 (owner: 10Muehlenhoff) [14:49:45] (03CR) 10Dzahn: "yep, only used on mw1152 which is trusty as of today. but this will unblock upgrading" [puppet] - 10https://gerrit.wikimedia.org/r/314514 (owner: 10Muehlenhoff) [14:53:31] (03CR) 10Hashar: [C: 031] "Looks fine to me. The versions are indeed the same on both permanent and nodepool slaves." [puppet] - 10https://gerrit.wikimedia.org/r/314516 (owner: 10Muehlenhoff) [14:54:21] (03PS2) 10Dzahn: Debian moved back to firefox, stop using iceweasel [puppet] - 10https://gerrit.wikimedia.org/r/314516 (owner: 10Muehlenhoff) [14:54:31] 06Operations, 10Traffic, 10Wikimedia-Apache-configuration, 13Patch-For-Review: Sometimes apache error 503s redirect to /503.html and this redirect gets cached - https://phabricator.wikimedia.org/T109226#2696809 (10BBlack) Side note on side note: we had some Varnish/VCL conditional code to treat hhvm and Ze... [14:55:51] 06Operations, 10Traffic, 10Wikimedia-Apache-configuration, 13Patch-For-Review: Sometimes apache error 503s redirect to /503.html and this redirect gets cached - https://phabricator.wikimedia.org/T109226#2696811 (10BBlack) On your repro attempts: I think the original case that was badly cached was for users... [14:56:19] (03CR) 10Dzahn: [C: 032] "yep, merging this on a former Iceweasel that became Firefox again" [puppet] - 10https://gerrit.wikimedia.org/r/314516 (owner: 10Muehlenhoff) [14:56:55] irccloud again? [14:59:56] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3034_v4, cp3034_v6 [14:59:57] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3034_v4, cp3034_v6 [15:00:52] looking ^ [15:01:06] PROBLEM - Host cp3034 is DOWN: PING CRITICAL - Packet loss = 100% [15:01:09] ema: me too, it's only going to be 3034 [15:01:13] (03PS8) 10Dereckson: Initial configuration for olo.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312807 (https://phabricator.wikimedia.org/T146612) (owner: 10MarcoAurelio) [15:01:25] 06Operations, 10Traffic: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569#2696846 (10BBlack) [15:01:32] cant ssh normally, i'll let you have mgmt console then [15:01:58] mutante: it hanged while rebooting, I'm gonna power-cycle it [15:02:06] ack, alright [15:02:35] gotta drive a car, bbiaw [15:02:39] it's kind of funny how the decentralized IRC network of ancient design ends up having a central, single point of failure, which is a "cloud" service :P [15:02:58] (03CR) 10Dereckson: "PS8: as the wiki is in commonsupload.dblist, removed extraneous settings, either contradictory or redudant" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312807 (https://phabricator.wikimedia.org/T146612) (owner: 10MarcoAurelio) [15:03:17] !log cp3034 hanging during boot, power-cycled [15:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:04:52] RECOVERY - Host cp3034 is UP: PING WARNING - Packet loss = 86%, RTA = 84.15 ms [15:05:01] !log dereckson@tin scap sync-l10n completed (1.28.0-wmf.20) (duration: 15m 59s) [15:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:05:16] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2011_v4, cp2011_v6 [15:05:16] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2011_v4, cp2011_v6 [15:05:17] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2011_v4, cp2011_v6 [15:05:28] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2011_v4, cp2011_v6 [15:05:36] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp1063_v4, cp1063_v6 [15:05:37] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp1063_v4, cp1063_v6 [15:05:47] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2011_v4, cp2011_v6 [15:05:47] Dereckson: I can't see the engineer messages on translatewiki yet, are they being imported in batches or... [15:05:53] mafk: er yes it is: https://ru.wikipedia.org/wiki/%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F:%D0%9F%D1%80%D0%B0%D0%B2%D0%B0_%D0%B3%D1%80%D1%83%D0%BF%D0%BF_%D1%83%D1%87%D0%B0%D1%81%D1%82%D0%BD%D0%B8%D0%BA%D0%BE%D0%B2 [15:05:57] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp2011_v4, cp2011_v6 [15:05:57] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp2011_v4, cp2011_v6 [15:05:57] ?uselang=en [15:05:57] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2011_v4, cp2011_v6 [15:06:02] mafk: since the 15:05:01 < logmsgbot> !log dereckson@tin scap sync-l10n completed (1.28.0-wmf.20) (duration: 15m 59s) [15:06:22] mafk: oh translatewiki, sorry [15:06:32] PROBLEM - Host cp2011 is DOWN: PING CRITICAL - Packet loss = 100% [15:06:35] I meant it reached ru.wikipedia at 15:05:01 [15:06:42] Dereckson: yep, but if it's on the wiki then it's not breaking nothing [15:06:48] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp2011_v4, cp2011_v6 [15:06:53] we did well in cherry-picking those :) [15:07:09] cp2011 also froze while booting [15:07:10] (03CR) 10Addshore: "an order of magnitude faster." [puppet] - 10https://gerrit.wikimedia.org/r/314563 (owner: 10Hashar) [15:07:33] Yes, I've documented that on https://wikitech.wikimedia.org/wiki/LocalisationUpdate#Running_LU_manually in May. [15:07:55] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Prepare storage layer for olo.wikipedia - https://phabricator.wikimedia.org/T147302#2696868 (10jcrespo) a:03jcrespo Claimed, but will be done together with @Marostegui for demonstration purposes. [15:07:59] New WikimediaMessages key must be cherry-picked before l10nupdate running. If not, it won't add new messages, only update existing ones. [15:09:25] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Prepare and check production and labs-side filtering for olowiki - https://phabricator.wikimedia.org/T147302#2696878 (10jcrespo) [15:10:59] and cherry-pick should happen once the master change is merged? (for the correct commit message I mean) [15:11:36] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 142 not-conn: cp2011_v4, cp2011_v6, cp2016_v4, cp2016_v6, cp4015_v4, cp4015_v6 [15:11:50] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 142 not-conn: cp2011_v4, cp2011_v6, cp2016_v4, cp2016_v6, cp4015_v4, cp4015_v6 [15:11:57] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2011_v4, cp2011_v6, cp4015_v4, cp4015_v6 [15:12:08] PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 142 not-conn: cp2011_v4, cp2011_v6, cp2016_v4, cp2016_v6, cp4015_v4, cp4015_v6 [15:12:37] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4015_v4, cp4015_v6 [15:12:37] RECOVERY - Host cp2011 is UP: PING OK - Packet loss = 0%, RTA = 36.67 ms [15:12:38] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4015_v4, cp4015_v6 [15:12:47] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4015_v4, cp4015_v6 [15:12:50] mafk: yes [15:12:56] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp4015_v4, cp4015_v6 [15:13:17] ignore the IPsec alerts above, those machines are just being restarted [15:13:28] Okay! [15:13:33] mafk: we don't generally deploy code not in master [15:13:40] I've seen one exception for debug purpose [15:14:03] yep, I know, only when there's need to have it /now/ [15:14:13] and another exception is the security patches [15:14:16] I'll continue committing to master [15:14:16] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 148 ESP OK [15:14:27] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 148 ESP OK [15:14:36] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 56 ESP OK [15:14:46] yes, the only time you need to cherry-pick is when you require a backport and are adding it to SWAT calendar [15:14:47] RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 148 ESP OK [15:14:47] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 148 ESP OK [15:15:17] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 56 ESP OK [15:15:18] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 56 ESP OK [15:15:27] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 56 ESP OK [15:22:21] PROBLEM - Host cp3045 is DOWN: PING CRITICAL - Packet loss = 100% [15:25:30] (03CR) 10Hashar: "But does not work on our CI for some reason :(" [puppet] - 10https://gerrit.wikimedia.org/r/314563 (owner: 10Hashar) [15:25:48] !log powercycle cp3045 [15:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:27:47] RECOVERY - Host cp3045 is UP: PING OK - Packet loss = 0%, RTA = 84.24 ms [15:28:48] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 56 ESP OK [15:29:07] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 70 ESP OK [15:29:21] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 56 ESP OK [15:29:23] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 70 ESP OK [15:29:23] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 70 ESP OK [15:29:37] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 56 ESP OK [15:29:38] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 148 ESP OK [15:29:38] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 148 ESP OK [15:29:47] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 56 ESP OK [15:29:58] PROBLEM - Varnish HTTP text-backend - port 3128 on cp3041 is CRITICAL: Connection refused [15:33:58] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [15:35:12] 06Operations, 10Pybal, 06Services, 13Patch-For-Review, 15User-mobrovac: Depool / repool scripts execute successfully even when the host has not been (r|d)epooled - https://phabricator.wikimedia.org/T145518#2696917 (10Joe) When T147480 will be reolved, this ticket will be partly solved; at least pool/depo... [15:35:29] RECOVERY - Varnish HTTP text-backend - port 3128 on cp3041 is OK: HTTP OK: HTTP/1.1 200 OK - 187 bytes in 0.171 second response time [15:39:42] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 56 ESP OK [15:41:21] <_joe_> !log upgrading conftool to 0.3.1 on all mw*, wtp* servers, T147480 T145518 [15:41:23] T147480: Upgrade conftool to 0.3.1 - https://phabricator.wikimedia.org/T147480 [15:41:23] T145518: Depool / repool scripts execute successfully even when the host has not been (r|d)epooled - https://phabricator.wikimedia.org/T145518 [15:41:26] !log dereckson@tin scap sync-l10n completed (1.28.0-wmf.21) (duration: 15m 46s) [15:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:41:48] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [15:42:03] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 70 ESP OK [15:46:55] (03CR) 10Alexandros Kosiaris: [C: 032] base: activate vlan reporting via LLDP [puppet] - 10https://gerrit.wikimedia.org/r/314450 (https://phabricator.wikimedia.org/T84518) (owner: 10Dzahn) [15:47:00] (03PS2) 10Alexandros Kosiaris: base: activate vlan reporting via LLDP [puppet] - 10https://gerrit.wikimedia.org/r/314450 (https://phabricator.wikimedia.org/T84518) (owner: 10Dzahn) [15:47:18] 06Operations, 10MediaWiki-extensions-VipsScaler, 10Wikimedia-Site-requests, 13Patch-For-Review: VIPS scaled thumbnails don't have a comment with a link to the file description page - https://phabricator.wikimedia.org/T71336#2696957 (10Dereckson) So the question is to know if we want to do this with exiv2,... [15:47:30] (03CR) 10Alexandros Kosiaris: [V: 032] base: activate vlan reporting via LLDP [puppet] - 10https://gerrit.wikimedia.org/r/314450 (https://phabricator.wikimedia.org/T84518) (owner: 10Dzahn) [15:48:38] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Oct 6 15:48:38 UTC 2016 (duration 7m 12s) [15:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:55:58] 06Operations, 10Traffic, 13Patch-For-Review: TLS stats regression related to Chrome/41 on Windows - https://phabricator.wikimedia.org/T141786#2696978 (10BBlack) 05Resolved>03Open Of course, I spoke too soon. The stats anomaly is slowly becoming visible again, and live logging confirms these broken clien... [15:56:19] (03PS1) 10BBlack: Text VCL: workaround fake/buggy Chrome/41 again [puppet] - 10https://gerrit.wikimedia.org/r/314573 (https://phabricator.wikimedia.org/T141786) [15:57:06] (03CR) 10BBlack: [C: 032 V: 032] Text VCL: workaround fake/buggy Chrome/41 again [puppet] - 10https://gerrit.wikimedia.org/r/314573 (https://phabricator.wikimedia.org/T141786) (owner: 10BBlack) [15:58:19] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [16:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161006T1600). Please do the needful. [16:00:04] mobrovac: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:56] mobrovac: ^ taking a look [16:02:09] (03PS6) 10Filippo Giunchedi: Extend classpath via Puppet [puppet] - 10https://gerrit.wikimedia.org/r/313619 (https://phabricator.wikimedia.org/T133395) (owner: 10Eevans) [16:03:33] (03CR) 10Filippo Giunchedi: [C: 032] Extend classpath via Puppet [puppet] - 10https://gerrit.wikimedia.org/r/313619 (https://phabricator.wikimedia.org/T133395) (owner: 10Eevans) [16:06:29] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [16:08:09] (03PS1) 10Alexandros Kosiaris: servermon: Fix an error in report handler [puppet] - 10https://gerrit.wikimedia.org/r/314574 [16:09:57] (03PS6) 10Filippo Giunchedi: Enable cassandra/twcs deploy repository [puppet] - 10https://gerrit.wikimedia.org/r/313892 (https://phabricator.wikimedia.org/T133395) (owner: 10Eevans) [16:10:38] mobrovac: will run puppet on tin/mira after ^ is merged to [16:10:39] too [16:10:43] (03CR) 10Alexandros Kosiaris: [C: 031] hiera: always search for the full key [puppet] - 10https://gerrit.wikimedia.org/r/312206 (https://phabricator.wikimedia.org/T147403) (owner: 10Giuseppe Lavagetto) [16:10:57] (03CR) 10Alexandros Kosiaris: [C: 032] servermon: Fix an error in report handler [puppet] - 10https://gerrit.wikimedia.org/r/314574 (owner: 10Alexandros Kosiaris) [16:11:01] (03PS2) 10Alexandros Kosiaris: servermon: Fix an error in report handler [puppet] - 10https://gerrit.wikimedia.org/r/314574 [16:11:03] (03CR) 10Alexandros Kosiaris: [V: 032] servermon: Fix an error in report handler [puppet] - 10https://gerrit.wikimedia.org/r/314574 (owner: 10Alexandros Kosiaris) [16:11:22] PROBLEM - puppet last run on aqs1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/cassandra.in.sh] [16:12:09] godog: --^ [16:12:20] elukey: thanks, taking a look [16:12:37] meh, recovered on the next puppet run [16:13:20] (03CR) 10Filippo Giunchedi: [C: 032] Enable cassandra/twcs deploy repository [puppet] - 10https://gerrit.wikimedia.org/r/313892 (https://phabricator.wikimedia.org/T133395) (owner: 10Eevans) [16:13:26] (03PS7) 10Filippo Giunchedi: Enable cassandra/twcs deploy repository [puppet] - 10https://gerrit.wikimedia.org/r/313892 (https://phabricator.wikimedia.org/T133395) (owner: 10Eevans) [16:13:36] yep all good! [16:13:49] * elukey blames urandom [16:13:50] :P [16:13:58] RECOVERY - puppet last run on aqs1001 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [16:14:13] (03CR) 10Filippo Giunchedi: [V: 032] Enable cassandra/twcs deploy repository [puppet] - 10https://gerrit.wikimedia.org/r/313892 (https://phabricator.wikimedia.org/T133395) (owner: 10Eevans) [16:14:23] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [16:14:54] godog: we'll need to run puppet on rb nodes too [16:19:18] mobrovac: yup, doing that now, so far so good [16:23:03] mobrovac: LGTM, puppet still finishing [16:25:14] \o/ [16:29:00] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2022_v4, cp2022_v6 [16:29:22] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [16:30:18] PROBLEM - Host cp2022 is DOWN: PING CRITICAL - Packet loss = 100% [16:31:04] !log power-cycling cp2022 [16:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:31:59] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [16:32:50] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [16:33:29] RECOVERY - Host cp2022 is UP: PING OK - Packet loss = 0%, RTA = 37.06 ms [16:39:59] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [16:40:00] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3046_v4, cp3046_v6 [16:40:09] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3046_v4, cp3046_v6 [16:40:09] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3046_v4, cp3046_v6 [16:40:29] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3046_v4, cp3046_v6 [16:40:39] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3046_v4, cp3046_v6 [16:40:58] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [16:42:09] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3046_v4, cp3046_v6 [16:42:42] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 56 ESP OK [16:42:42] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 56 ESP OK [16:42:43] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 56 ESP OK [16:42:58] godog: ms-be1026 ^ ? [16:43:02] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 56 ESP OK [16:43:19] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 56 ESP OK [16:44:11] swift-container-replicator is quite active [16:44:39] box has a load of 42.. for a box with 40 cores it isn't much [16:45:16] akosiaris: yeah there's ms-be1022 being put in service so that might be the rebalancing, but I think I'll send out a patch to increase the timeout further and/or check less often [16:45:22] most of those are spurious anyway [16:45:32] spurious meaning timeouts [16:46:10] hmmm it does indeed take a very long time to return [16:46:14] /usr/local/lib/nagios/plugins/check_hpssacli [16:46:22] that is [16:46:37] I suppose with the box being under heavy IO [16:46:48] the controller is busy doing other things than reporting to management checks [16:47:26] yeah, and since we expose every disk to linux it is number of disks 2x (ld and pd) [16:47:35] the number of things to check that is [16:47:46] real 0m44.362s [16:47:48] damn [16:47:50] for 4 secs :P [16:48:11] heheh and I've bumped it already once heh [16:50:10] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2005_v4, cp2005_v6, cp3044_v4, cp3044_v6 [16:51:28] PROBLEM - HP RAID on ms-be1024 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [16:54:24] !log uploaded to apt.wikimedia.org precise-wikimedia/main: php5_5.3.10-1ubuntu3.25+wmf1 [16:54:25] moritzm: ^ [16:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:54:59] PROBLEM - HP RAID on ms-be1025 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [16:57:00] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:58:09] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 56 ESP OK [16:58:10] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 56 ESP OK [17:00:04] yurik, gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161006T1700). [17:00:20] no parsoid deploy [17:00:29] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1099_v4, cp1099_v6 [17:00:39] PROBLEM - IPsec on cp4007 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1099_v4, cp1099_v6 [17:00:39] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1099_v4, cp1099_v6 [17:00:39] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1099_v4, cp1099_v6 [17:00:59] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp1099_v4, cp1099_v6 [17:03:09] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 54 ESP OK [17:03:28] RECOVERY - IPsec on cp4007 is OK: Strongswan OK - 54 ESP OK [17:03:29] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 54 ESP OK [17:03:48] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 54 ESP OK [17:04:12] no ORES either [17:04:21] 06Operations, 10Cassandra, 06Services, 10hardware-requests: 9x or 15x additional Cassandra/RESTBase nodes - https://phabricator.wikimedia.org/T139961#2697180 (10GWicke) In case the AMS procurement hasn't happened yet, it might make sense to also consider the old AQS nodes (see T147460) in eqiad for use as... [17:04:29] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6 [17:10:59] PROBLEM - Host cp2017 is DOWN: PING CRITICAL - Packet loss = 100% [17:11:26] !log power-cycling cp2017 [17:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:15:18] RECOVERY - Host cp2017 is UP: PING WARNING - Packet loss = 54%, RTA = 36.19 ms [17:17:31] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp3035_v4, cp3035_v6 [17:17:34] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp3035_v4, cp3035_v6 [17:17:34] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3035_v4, cp3035_v6 [17:19:58] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 56 ESP OK [17:19:59] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 148 ESP OK [17:20:00] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 148 ESP OK [17:20:09] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 56 ESP OK [17:33:50] (03PS1) 10Rush: bdsync debian directory [debs/bdsync] - 10https://gerrit.wikimedia.org/r/314591 [17:37:03] (03PS2) 10Andrew Bogott: fix labstore cluster: labsnfs [puppet] - 10https://gerrit.wikimedia.org/r/309689 (owner: 10Alex Monk) [17:41:46] (03CR) 10Andrew Bogott: [C: 032] fix labstore cluster: labsnfs [puppet] - 10https://gerrit.wikimedia.org/r/309689 (owner: 10Alex Monk) [17:41:49] RECOVERY - HP RAID on ms-be1025 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [17:43:28] RECOVERY - HP RAID on ms-be1024 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [17:44:44] (03PS2) 10Andrew Bogott: couple more labs support host hiera cluster key cleanups [puppet] - 10https://gerrit.wikimedia.org/r/309690 (owner: 10Alex Monk) [17:49:29] PROBLEM - HP RAID on ms-be1025 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [17:49:49] (03CR) 1020after4: [C: 031] Create a phabricator_stopwords phabricator table in sql (innodb) [puppet] - 10https://gerrit.wikimedia.org/r/314286 (https://phabricator.wikimedia.org/T146673) (owner: 10Paladox) [17:51:26] Ummm, gerrit where'd you go? [17:51:33] Timeout...? [17:52:08] Pings ok, as is ssh. [17:52:26] ostriches: dropped out on me too [17:52:40] Ok so not just me. I'm looking at it now [17:53:28] 06Operations, 10Traffic: Sideways Only-If-Cached on misses at a primary DC - https://phabricator.wikimedia.org/T142841#2697291 (10BBlack) 05Open>03declined This seems really complicated to get "right", and it's only in corner cases that it even helps us much. There's potential downsides on the pattern-ada... [17:53:32] Ok it's gerrit not apache. [17:53:34] Slow but responsive for me [17:54:02] ostriches: it eventually loaded but a reload is takign the same long ass time fyi [17:54:14] Weird..... [17:54:23] I see nothing (unusual) in the gerrit error logs. [17:57:09] RECOVERY - HP RAID on ms-be1025 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [18:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161006T1800). Please do the needful. [18:00:04] SMalyshev: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:08] Um, CPU is thrashing though [18:00:09] https://ganglia.wikimedia.org/latest/graph_all_periods.php?h=lead.wikimedia.org&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1475776734&g=cpu_report&z=large&c=Miscellaneous%20eqiad [18:01:10] I can SWAT today (if the stars align with gerrit, et al). [18:01:40] SMalyshev: ping for SWAT. [18:03:36] thcipriani: here! [18:03:42] ok! [18:04:09] (03PS7) 10Thcipriani: Add config for units on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311206 (https://phabricator.wikimedia.org/T117032) (owner: 10Smalyshev) [18:04:22] heyo, gerrit thought about that for a while [18:04:33] !log gerrit: kicking gerrit and apache, something is unhappy... [18:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:04:43] ostriches: is it fine to be SWATting now? Or do you wanna...ok, pausing :) [18:05:52] hmm... gerrit is down? [18:06:35] yes [18:06:45] yarp, seems like it was having a bad time. [18:07:01] i just noticed it when using git review, the ssh on high port isnt running [18:07:13] the web ui was still there for me [18:07:16] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [18:07:17] Ummmm. [18:07:21] Why ain't you starting? [18:07:28] * ostriches kicks gerrit harder [18:07:34] wtf gives? [18:08:24] Is there any errors in the log? [18:08:59] No. [18:09:04] Oh [18:09:04] just that it enters a failed state [18:09:25] Oh wait it does take a while after you do bin/gerrit.sh start for gerrit to fill the log [18:09:27] PROBLEM - SSH access on lead is CRITICAL: Connection refused [18:09:32] gerrit2 30659 152 0.8 32475860 264160 ? Sl 18:06 4:31 GerritCodeReview -Xmx28g -jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site --run-id=1475777179.30635 [18:09:36] so it is running [18:09:42] Claims it is. [18:10:01] 06Operations, 10Traffic, 07Beta-Cluster-reproducible, 13Patch-For-Review: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#2697362 (10ori) If I understood [[ https://github.com/facebook/hhvm/blob/235b6ed60f54fe4d1f18bc9592e4a7ea5f573b05/hph... [18:10:23] Unit gerrit.service entered failed state. [18:10:29] Yeah, which I don't understand. [18:10:38] [2016-10-06 18:10:31,127] [main] INFO org.eclipse.jetty.server.Server : Started @250013ms [18:10:38] [2016-10-06 18:10:31,161] [main] INFO com.google.gerrit.pgm.Daemon : Gerrit Code Review 2.12.3 ready [18:10:40] lol [18:10:41] It wasn't even logging while trying to start. [18:10:45] expected more in journalctl but meh [18:11:01] seems to be working now [18:11:08] 250secs for start up ? [18:11:17] [2016-10-06 18:09:20,055] [main] INFO org.eclipse.jetty.util.log : Logging initialized @178909ms [18:11:20] That seems bad ^ [18:11:25] only java processes are able of that [18:11:28] Website is back [18:11:34] It shouldn't take nearly that long. [18:11:35] It never has. [18:11:43] It has for me [18:11:54] on test instances it takes a while [18:11:56] a little bit delay seemed normal but not that much [18:11:57] akosiaris: thanks, will take care of the updates tomorrow [18:12:06] moritzm: ok thanks [18:12:10] RECOVERY - SSH access on lead is OK: SSH OK - GerritCodeReview_2.12.3 (SSHD-CORE-0.14.0) (protocol 2.0) [18:12:19] No, for prod that's unacceptable. [18:12:24] Nothing is loading on gerrit though [18:12:24] It's never taken that long. [18:12:31] Just showing working [18:12:43] Caches are all cold, things gonna be slow. [18:12:48] Oh [18:13:02] This still seems off though [18:13:15] Oh [18:13:42] This still isn't right, something's up [18:13:53] 5000 ms timeout reached for IntraLineDiff in project mediawiki/core on commit bf2a4ef40f18e346c64c5db558e04586a2a3a5f8 for path languages/i18n/zh-hans.json comparing 39a573a46ead49140af117f1f0ba298c02b3043d..4cb1ff44b003f8c3abc1d83305aa596b31673f0c [18:14:01] ok.. something must be wrong [18:14:05] how is the db doing [18:14:22] Yeah things gonna be timing out. [18:14:28] akosiaris: https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=lead.wikimedia.org&m=cpu_report&s=by+name&mc=2&g=cpu_report&c=Miscellaneous+eqiad [18:14:35] You can see where I killed gerrit, but cpu usage shot right back up [18:14:46] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [18:15:02] ah yes I can see it in htop as well [18:15:36] That log file thing is bothering me from startup. Some of those logs are kinda big from this rotation. I wonder if I should force-rotate them and see... [18:16:27] what changed on 17:49 ? [18:16:43] I dunno, I wasn't doing any work on lead. [18:16:52] Just noticed the UI was getting laggy/timeouty [18:17:39] hmmm the raid is resyncing [18:17:45] well tries to resync [18:17:53] md1 : active (auto-read-only) raid1 sda2[0] sdb2[1] [18:17:54] 976320 blocks super 1.2 [2/2] [UU] [18:17:54] resync=PENDING [18:18:33] which is swap... [18:18:43] the rest though look ok [18:18:52] and there isn't really any sign yet of a broken disk [18:18:55] sees some exceptions in /var/lib/gerrit2/review_site/logs/error_log but that doenst seem to match the timing of 17:49 [18:19:12] Yeah most of that is background noise [18:19:23] and the system is not in so much io-wait anyway [18:19:35] it's mostly user CPU usage [18:19:36] akosiaris: I recall faidon mentioning one time that swap will always look like that until accessed [18:19:54] chasemp: hmm [18:21:17] akosiaris: I'm going to stop puppet on lead for a bit so it doesn't come along and try to be smart and kick processes or whatever. [18:21:27] ok [18:21:34] Plus I can force the downtime message so it won't give people false hopes [18:21:49] !log lead: disabled puppet for now, gerrit's sick [18:21:50] PROBLEM - puppet last run on bast3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:22:19] ostriches: is puppet agent --disable stil running indeed ? [18:22:24] Yes. [18:22:29] Er, just finished [18:22:30] 35 seconds ? [18:22:42] it's touch a single file.. wtf ? [18:23:10] wow [18:23:27] it is ruby, don't forget ;) [18:24:10] !log lead: restarting apache to force error page to show for now [18:24:13] 06Operations, 10Traffic, 07Beta-Cluster-reproducible, 13Patch-For-Review: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#2697376 (10BBlack) In general, zlib supports a defined compression level of `0`, which means "no compression", but is... [18:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:24:32] :( [18:24:38] I am getting this feeling it's a box ... [18:24:45] gerrrrrrrrrrrrrit [18:24:47] not sure what I base it on yet [18:25:05] Something hardware with lead you mean? [18:25:14] sigh, bot* [18:25:24] Ohhhh [18:25:28] look at how CPU usage went down the moment you restarted apache [18:25:36] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [18:26:07] also an strace showed weird patterns [18:26:23] [pid 31580] stat("/srv/gerrit/git/operations/mediawiki-config.git/refs/changes/51/30551", [18:26:28] immediately followed by [18:26:40] [pid 31575] openat(AT_FDCWD, "/srv/gerrit/git/mediawiki/core.git/refs/changes/05/51805", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC [18:26:49] and [18:26:51] [pid 31575] close(324 [18:26:51] [pid 31580] stat("/srv/gerrit/git/operations/mediawiki-config.git/refs/changes/51/170351", [18:26:55] 06Operations, 10Traffic, 13Patch-For-Review: TLS stats regression related to Chrome/41 on Windows - https://phabricator.wikimedia.org/T141786#2697392 (10Nemo_bis) >>! In T141786#2696978, @BBlack wrote: > At this point I think it's more likely a fake UA string from some kind of benchmarking or other software... [18:27:16] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [18:27:34] didnt we block google bot from indexing one time [18:27:40] i see baidu spider in access log [18:28:02] hmm, slow though, nevermind [18:30:28] ARRGH [18:30:36] Gerrit redirects to /error.html when it's down [18:30:44] I'm gonna stop the phab mirroring task just in case it's related to that... [18:30:49] So now any Gerrit tab that I open, I lose track of what change I was trying to view [18:31:17] Dear gerrit, you suck. [18:31:22] So, /so/ much. [18:31:25] akosiaris / ostriches: did that help? [18:31:40] !log stopped phd on iridium to relieve some load on gerrit [18:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:31:47] That's not gerrit's fault. [18:31:51] James_F: {{SoFixIt}} [18:31:53] I helped. [18:31:55] That's my fault so people stop getting false hopes that it's working [18:32:08] Well, it /was/ working [18:32:10] twentyafterfour: slightly [18:32:53] I was thinking last night's update might have caused phabricator to mirror more aggressively [18:33:13] (there were changes related to git mirroring scheduler) [18:33:23] well phab is probably that one CPU pegged at 100% [18:33:24] I'm curious why it would suddenly have exploded this morning then [18:33:27] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [18:33:32] evening :P [18:33:35] Same thing [18:33:40] * akosiaris joking [18:33:48] (as opposed to when the update happened last night) [18:34:05] I'm not sure, it was just a stab in the dark [18:34:08] look at the ganglia graph [18:34:14] how it went down since the phab stop [18:34:27] no that's apache being restarted [18:34:27] the mirroring change sounds likely i guess [18:34:30] oh [18:34:43] That was apache being restarted and forcing the error page. [18:34:50] So no web traffic is hitting gerrit now [18:34:55] ok [18:35:26] now it's mostly in futexes according to strace [18:37:05] Can we do a snapshot of /srv/gerrit/git and copy it off lead? I'm getting paranoid. [18:37:29] sure [18:37:36] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [18:37:37] PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 8 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [18:37:55] lemme see how big it is [18:38:06] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 10 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_analytics/reportupdater],Exec[git_pull_geowiki-scripts],Exec[git_pull_statistics_mediawiki] [18:38:06] About 20G, I think [18:38:08] Give or take [18:39:19] sigh I think it's going to take forever... [18:39:24] too many small files [18:39:33] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhprof],Exec[git_pull_operations/software/xhgui] [18:39:35] PROBLEM - puppet last run on neon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:39:57] akosiaris: Don't worry about it, it's probably fine. [18:40:05] We have bacula backups, whenever that ran last :) [18:40:25] PROBLEM - puppet last run on graphite1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 8 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot] [18:40:41] and ofc all git pulls fail now [18:40:48] Yeah.... [18:41:05] Ok, so I'm gonna try truncating those logs and restarting gerrit again. That startup time for log processing has me worried. [18:41:11] And could kinda explain things.... [18:41:16] ok [18:41:19] last bacula backup from 2016-10-06 02:05:18 [18:42:17] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 8 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [18:42:25] PROBLEM - PHD should be supervising processes on iridium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 997 (phd) [18:42:53] Ok this is taking too long again, even with no logs. [18:43:39] there are 2 now running [18:43:48] pids 30659 34017 [18:43:52] That's bizarre. [18:43:55] I did a stop, then a start. [18:43:58] Why did 2 start? [18:44:09] 30659 is the old one [18:44:19] I was running strace on it like 5-10 mins ago [18:44:19] twentyafterfour: if you are looking at phd I think it crashed because of gerrit unavail [18:44:22] So it didn't stop, but says it did. [18:44:23] Great. [18:44:31] mirroring some many repos it couldn't handle the torrent of errors [18:44:43] ACKNOWLEDGEMENT - PHD should be supervising processes on iridium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 997 (phd) daniel_zahn twentyafterfour !log stopped phd on iridium to relieve some load on gerrit [18:44:44] no, 20after4 stopped it [18:44:48] ah [18:45:18] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [18:45:18] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [18:46:13] Ok, all java procs killed. [18:46:49] We'll let that rsync finish and then have a fourth look at starting. [18:46:50] RECOVERY - puppet last run on bast3001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [18:49:32] ostriches: start it, I 've killed that rsync... made no sense anyway [18:49:46] we got a backup from this day anyway [18:50:04] I just did the start, it's being slow still [18:50:26] and phabricator was mirroring everything as well so there shouldn't be much data loss [18:50:37] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [18:50:37] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [18:50:46] Job for gerrit.service failed. See 'systemctl status gerrit.service' and 'journalctl -xn' for details. [18:50:48] Again!? [18:51:19] It started though [18:51:19] PROBLEM - SSH access on lead is CRITICAL: Connection refused [18:51:22] I don't understand. [18:51:48] PROBLEM - puppet last run on kafka2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [18:52:01] [2016-10-06 18:51:51,684] [main] INFO org.eclipse.jetty.util.log : Logging initialized @176824ms [18:52:05] Again with the slow logging start. [18:52:57] RECOVERY - PHD should be supervising processes on iridium is OK: PROCS OK: 4 processes with UID = 997 (phd) [18:52:58] yeah, and without any load [18:53:06] 244 secs this time around [18:53:20] database perhaps [18:53:23] looking at that [18:53:43] It's on m2 [18:54:02] RECOVERY - SSH access on lead is OK: SSH OK - GerritCodeReview_2.12.3 (SSHD-CORE-0.14.0) (protocol 2.0) [18:54:03] db1020 is doing nothing... [18:54:24] i see a guy on stackoverflow with "delayed jetty start" etc, he added DEBUG logging [18:54:52] PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 7 minutes ago with 4 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/endowment] [18:55:36] another one talks about the github plugin [18:55:36] I doubt db1020 is to blame [18:55:41] PROBLEM - puppet last run on stat1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_geowiki-scripts],Exec[git_pull_analytics.wikimedia.org] [18:55:58] mutante: We don't have a github plugin [18:56:41] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 5 minutes ago with 8 failures. Failed resources (up to 3 shown): Exec[git_pull_refinery_source],Exec[git_pull_analytics/discovery-stats],Exec[git_pull_aggregator_code],Exec[git_pull_analytics/reportupdater] [18:56:50] It's not the gc runner on the repos, that runs saturdays. [18:56:52] Hmmm [18:57:55] I am ruling out db1020.. it's not the DB at fault here [18:58:26] ostriches: gotcha, i just saw the replication to github, but also replicateOnStartup is false [18:59:41] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [19:00:04] thcipriani: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161006T1900). Please do the needful. [19:00:07] could it be hitting any of the limits, like packedGitOpenFiles = 4096 etc [19:00:13] nop [19:02:20] mutante: I don't think so... [19:03:13] ostriches: can we run jetty with -DDEBUG ? [19:03:22] hmmm other stuff is also slow [19:03:27] mutante: We could manually start it with that yeah [19:03:48] for example apt-get ... on a box that does practically nothing right now [19:04:10] We could stop gerrit entirely for a bit, see if something else is at cause. We might be chasing the wrong rabbit. [19:04:36] * akosiaris running perf top [19:04:43] no /topic change about gerrit? :) [19:05:09] nothing in perf top [19:05:53] perf top output is quite ok.. not pointing at anything [19:06:05] what's up? [19:06:18] well, not gerrit [19:06:26] * ostriches whacks Krenair with a bat [19:06:32] lol [19:06:39] :D [19:07:01] ori: Lead flipped out a little over an hour and a half ago. CPU is pegged, can't entirely figure out why. [19:07:41] can I log in and take a look as well, or would that be disruptive? [19:07:48] It's probably gerrit's fault, but nothing is showing up as to why that's the case [19:08:29] Having a look can't hurt. It's responding on SSH just fine, network is ok :) [19:08:47] ori: feel free [19:08:55] Ok, java finally calmed down... [19:09:03] Only using 4.5% now [19:09:13] Apache still busy af. [19:09:32] for the rate of requests being handled, apache is way too busy [19:09:58] AH00161: server reached MaxRequestWorkers setting, consider raising the MaxRequestWorkers setting [19:10:17] but that looks transient [19:10:30] and not anything that would cause that obviously [19:11:30] 12:04 < ostriches> We could stop gerrit entirely for a bit, see if something else is at cause. We might be chasing the wrong rabbit. [19:11:33] ^ sounds good [19:12:03] and you did, i'm just delayed [19:12:06] Well I'm starting to suspect apache tbh. Gerrit's basically doing nothing but apache flips out [19:12:22] jstack is useless... [19:13:09] Can't print deadlocks:Unable to deduce type of thread from address 0x00007f4658002000 (expected type JavaThread, CompilerThread, ServiceThread, JvmtiAgentThread, or SurrogateLockerThread) [19:13:12] thanks a lot ... [19:15:07] it's being crawled by Googlebot currently, but that is not out of the ordinary, judging by the access logs [19:15:12] PROBLEM - Host lithium is DOWN: PING CRITICAL - Packet loss = 100% [19:15:30] ori: plus google bot is just getting the error page [19:15:37] and yet still gerrit is painfully slow [19:17:03] most traffic from iridum/phab [19:17:41] jmap not very interesting either. [19:18:40] I can stop phd again and disable puppet so it won't restart [19:18:55] UA git/1.9.1 from phab hitting ..refs?service=git-upload-pack [19:19:04] should I stop it? [19:19:39] dunno, but at least we know there was a change with that last night you said, right [19:20:07] well it was supposed to improve the situation with retrying failed mirroring jobs, not make it worse [19:20:16] but it could be buggy [19:20:26] Did the bast3001 fingerprint change? [19:20:29] i just notice there is a lot of that when filtering out error.html [19:20:34] hoo: yes [19:20:42] Hmm, that could explain a lot of the apache traffic. [19:20:50] If Phab's retrying too aggressively [19:20:55] mutante: Any chance you could document it on https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/bast3001.wikimedia.org [19:20:59] Oh i thought upstream fixed phab [19:21:01] /var/log/apache2# tail -f gerrit.wikimedia.org.https.access.log | grep -v error.html [19:21:03] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:21:05] i was on that [19:21:08] i saw a patch related to it keep retrying [19:21:15] https.access.log as opposed to the other [19:21:28] mutante: Yeah I filter the https and http (the former just handles redirects) [19:21:45] paladox: that's what I was referring to - it supposedly improved the retry timeout but maybe not [19:21:50] https://secure.phabricator.com/rP5d1359d78f66c8fbc0f777691fc04c935a942689 [19:21:53] Yeh [19:22:13] I guess if that is causing gerrit's high usage, then that fix probaly actually caused the problem [19:23:09] twentyafterfour: What time did you upgrade Phab last night? I wonder if it corresponds to hitting a retry window on a bunch of repos... [19:23:19] https://secure.phabricator.com/D16575 [19:23:31] ostriches: just about 6:30 pacific [19:23:44] !log disabled phd and puppet on iridium [19:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:23:53] aside from a 60Mbps outbound traffic spike right before problems started, which correlates with a bot crawling/agressive mirroring [19:24:11] I can't really find something else wrong with the ox [19:24:13] Ah agressive mirroring would probaly be caused by phab [19:24:13] box* [19:24:27] > 00:15 twentyafterfour: phabricator update complete and service is restored [19:24:31] hoo: i updated the page [19:24:47] godog: ^ i ran "gen_fingerprints" and pasted that to wiki [19:24:54] once put that into base [19:25:06] funny thing is current CPU usage, with the service down is around 15%, before that incident, it was 2% [19:25:15] I'd be surprised if phabricator could generate 60 megabit spike - it only mirrors one repo at a time [19:25:30] the gerrit docs do recommend very high end specs for if you have alot of ssh traffic, ie through git clone ssh* [19:25:48] before == 17:46 UTC [19:26:21] is opening URLs ending in refs?service=git-upload-pack especially expensive? [19:26:23] the 60mbit traffic spike is at 17:17 btw [19:26:28] and is probably not really relevant [19:27:10] https://ganglia.wikimedia.org/latest/?r=4hr&cs=&ce=&m=cpu_report&tab=ch&vn=&hide-hf=false&hreg%5B%5D=iridium%7Clead [19:27:13] and the box has had these kind of network spikes in the past 6 hours as well, so probably unrelated [19:27:25] I don't see any correlation [19:27:40] I am doubting there is any [19:28:01] Network spikes are common [19:28:06] Some git repos are big [19:28:28] we can probably rule out phab mirroring as a culprit I think. It does put load on gerrit but I don't think it's misbehaving [19:28:54] CPU usage is still too high if phab's not hitting it still [19:29:01] agreed with both of you... question still remains.. what on earth [19:29:10] mutante: Thanks [19:29:15] Could this be a dns attack? [19:29:20] ostriches: yeah phd is stopped again [19:29:22] paladox: ? [19:29:34] paladox: you mean a dos attack? [19:29:35] dns attacks can take down sites [19:29:36] lol [19:29:38] yes [19:29:39] there are still all those hits from iridium though [19:29:39] sorry [19:29:40] no [19:29:46] oh [19:29:53] mutante: iridium is still hitting it now? [19:29:57] yea [19:29:59] as in no way... we would have seen the traffic hitting the ox [19:30:00] box* [19:30:06] and it would not manifest as CPU [19:30:18] oh [19:30:25] We could ban googlebot, but it should be already.... [19:30:58] twentyafterfour: now it stopped it looks [19:31:18] mutante: I guess phd didn't want to give up so easily. I had to run service phd stop again [19:31:37] Eh, I removed robots.txt at some point... [19:31:37] I would run iozone to see if the IO subsystem is misbehaving but it clearly is not.. the box is NOT in IOwait [19:31:41] the gerrit error page sends a HTTP 200 response wtf? [19:31:55] Vulpix: Sorry, bigger problems right now. [19:32:04] (usually it sends a 503) [19:32:16] (I forced all traffic to the error page for now) [19:32:29] well, a 503 would definitively slow down googlebot, 200 won't [19:32:43] Yeah. [19:32:51] the number of req/s is to low to be causing this [19:33:01] I can actually tail the apache access logs [19:33:14] as in I can read them... [19:33:22] yes, since phd stopped [19:33:29] and still something's not right [19:33:51] Well, not sure. [19:33:55] cpu usage is dropping on apache finally [19:34:04] I'm inclined to think it's phab... [19:34:07] wanna try starting the service one more time? [19:34:08] ostriches: wanna try restarting gerrit ? [19:34:14] "I guess phd didn't want to give up so easily. I had to run service phd stop again" [19:34:21] ^ There was a lag [19:35:29] Oh here's part of the fun, I don't think `service gerrit stop` actually works [19:35:32] That explains half my fun earlier [19:36:09] sends a HUP and then a KILL I see ? [19:36:11] Shutting down cleanly. [19:36:18] I'm using the other wrapper [19:36:36] we should write a systemd unit for that [19:36:53] lol systemd [19:37:04] akosiaris: Upstream provides an init script I've been using. [19:37:09] But yeah, we can redo it I guess [19:37:55] yeah at some point .. but anyway, we digress... I see it's starting again [19:37:57] let's see [19:38:00] Slowly [19:38:01] Yeah [19:38:55] Failed...fantastic [19:39:10] Looks, like something wrong [19:39:15] No, just lies. [19:39:15] but should start soon after [19:39:40] well I can see that process reading stuff [19:39:53] is it still jetty being slow? [19:40:00] Not the jetty bit. [19:40:01] the phabricator issue that got fixed _would_ increase update frequency but it shouldn't affect the maximum frequency at all [19:40:01] [2016-10-06 19:39:45,397] [main] INFO org.eclipse.jetty.util.log : Logging initialized @180419ms [19:40:06] Well, jetty log? [19:40:20] 2 seconds slower that last time [19:40:22] great! [19:40:53] Progress! [19:40:56] (in the wrong direction) [19:41:25] wanna try the manual jetty start with -DDEBUG? [19:41:31] yes please [19:41:46] btw, should it not be trying to connect to the db or something ? [19:41:55] I don't have any connections to db1020... [19:42:06] a here we are [19:42:15] got 1 [19:42:42] takes a while for that to show up so I assume it is required very late in the startup cycle [19:42:43] why not try bash -x bin/gerrit.sh start [19:43:24] you will get a trace on the init script to see if there is any problems, wont tell you much [19:43:29] but might give you something [19:43:42] why is the box in so much system CPU ... [19:43:42] I've already been looking at that. [19:43:45] And it says nothing [19:44:43] i don't dare to say it.. but .. reboot [19:45:07] have you tried turning it off and on again ? :P [19:45:08] +1 [19:45:18] obligatory IT crowd pun... [19:45:26] I mean heck, it couldn't hurt. [19:45:31] At this point [19:45:59] Did you check /var/log/apache2/gerrit_error.log ? (Sorry if you already done this but just wondering) [19:46:09] paladox: yes [19:46:12] ok [19:46:14] thanks [19:46:31] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:46:45] mutante, akosiaris: Consensus on a reboot? [19:46:54] ostriches: ok from me [19:46:54] uptime 188 days ;-) [19:47:05] IPMI-sel says nothing btw [19:47:06] yea [19:47:10] !log lead: rebooting, because what have we got to lose [19:47:11] temperatures are fine as well [19:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:47:43] is somebody on mgmt? [19:48:00] might see something if hardware issues [19:48:07] I 'll connect [19:49:48] Could this be related https://phabricator.wikimedia.org/T103990 [19:49:55] No [19:50:02] ok [19:50:30] hmmm [19:50:37] [16261189.997267] INFO: rcu_sched self-detected stall on CPU [19:50:41] still not off btw [19:50:46] I did shutdown -r [19:50:57] And I can't ping it anymore [19:51:09] [16261203.146950] reboot: Restarting system [19:51:14] ok I am seeing BIOS [19:51:32] PROBLEM - Host lead is DOWN: PING CRITICAL - Packet loss = 100% [19:52:16] kernel loaded [19:53:22] RECOVERY - Host lead is UP: PING OK - Packet loss = 0%, RTA = 2.30 ms [19:53:27] ssh is up [19:53:40] gerrit is trying to start [19:53:45] Oh yay [19:53:51] bets on "mysteriously fixed but no indication what happened"... [19:53:57] presses thumbs [19:54:35] jetty still hasn't logged anything [19:54:47] Yeah slow logging start again [19:54:55] uhmpf.. [19:56:35] help [19:56:40] wrong window [19:57:04] It's starting now [19:57:04] so, raclog, SEL have nothing [19:57:15] @181095ms [19:57:17] even worse [19:58:52] @265444ms... ok really weird [19:59:49] something like $ java -jar bin/gerrit.war daemon -d . --show-stack-trace ? [19:59:57] I'm stumped. [20:00:09] mutante: There's no stack trace to show though [20:00:59] can we somehow skip the logging part? [20:02:12] I'd have to write a quick log4j.properties. [20:02:25] The log.* settings in gerrit don't allow for disabling all logging [20:02:58] log4j.rootLogger=OFF might do it [20:03:18] I am sorry, what is the theory ? that logging makes gerrit slow ? [20:03:28] it is barely logging anything anyway... [20:04:04] I'm saying the logging startup is taking too long [20:04:07] But yeah, we want logging [20:04:15] We should do mutante's -DDEBUG idea [20:04:20] yes [20:04:23] yes, we want logging, but since the logging startup is always super slow [20:04:33] just thought we'd want to see if it starts without that or not [20:04:40] We should do the debug thing first [20:04:43] ok [20:05:07] mutante: Can you do that? I'm going to follow along mobile for like 5-10m, I've got to run and grab a coffee & smoke or I'm gonna go nuts. [20:05:55] it's as if the CPUs have been rate limited to a 486 ... [20:06:14] at least when doing some things [20:06:26] but /proc/cpuinfo says 5000bogomips [20:09:36] ok, so i saw "ran jetty with -DDEBUG" but how does it get started on lead [20:11:32] running a sysbench.. I want to compare the numbers [20:11:38] If there is no logs or anything be wrritten too, does that mean the disk has gone in to protection mode ie read only mode [20:11:39] ? [20:11:54] no [20:11:58] ok [20:12:14] any effort to write to read-only fs would be displayed [20:12:39] ok [20:14:53] mutante: I am doubting this is gerrit anymore [20:15:10] i was reading jetty.pp and the init script [20:15:18] akosiaris: does it feel like hardware error ? [20:15:21] I am gonna reboot once more with an older kernel... not sure what I would accomplish [20:15:31] well a hardware error would have some indication [20:15:39] it's an expensive box for a reason [20:15:48] *nod* [20:16:04] I would be happy if it was a hardware issue [20:16:07] we would find it out [20:16:12] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:16:37] It could be a hardwhere issue, maybe not one that can be tracked, but then i am guessing. [20:16:40] yes to all of that. do the older kernel [20:16:49] ok rebooting with 3.16 [20:17:27] broken RAM can do the weirdest things,, but i saw nothing [20:17:34] !log rebooting lead once more [20:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:18:05] Do we track ram if it breaks? Per mutante [20:18:43] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [20:18:46] ram problems usually aren't this kind of consistent though. segfaults is what I'd expect with ram problems [20:18:50] PROBLEM - Host lead is DOWN: PING CRITICAL - Packet loss = 100% [20:19:00] not super slow performance [20:20:51] RECOVERY - Host lead is UP: PING OK - Packet loss = 0%, RTA = 5.84 ms [20:21:53] asks #gerrit for ideas [20:22:29] the older kernel makes no difference [20:23:00] at least, I witness the same things, let's wait for gerrit's startup numbers [20:24:48] is there IO contention? [20:24:48] https://etherpad.wikimedia.org/p/gerrit-outage-20161006 - started at least a rough draft so we don't forget details [20:24:48] no [20:24:48] Vulpix: Not that we've seen, no [20:24:48] org.eclipse.jetty.util.log : Logging initialized @180582ms [20:24:48] ok no diff [20:24:48] so if it is a kernel bug, it decided to manifest today and is present across the 3.16-4.4 versions [20:24:53] and somehow I don't feel it's that possible anymore [20:25:08] org.eclipse.jetty.server.Server : Started @260266ms [20:25:16] every time it's a bit more... [20:25:26] I'm pretty much outta ideas. [20:25:53] yeah, if no IO, cpu nor network issues, what could it be? [20:26:47] http://stackoverflow.com/questions/26927939/jetty-startup-delay-due-to-scanning [20:26:54] we might have to try putting the role on another box [20:27:44] I'm just now getting kinda caught up, it's a huge backlog in here [20:27:53] maybe timeout trying to connect to a DB or whatever [20:28:02] Vulpix: Pretty sure that's not it, DB seems fine [20:28:11] some random thoughts: I had hardware inclinations throughout the story above, but I think failing cpu/mem would've shown other randomness or kernel crash reports, the results are a little too consistent for that [20:28:48] yeah it seems like a jetty / gerrit / java issue to me [20:29:01] Vulpix: And just using mysql cli from lead works too with no lags. [20:29:16] 2) Can we strace -ff a startup of gerrit to see what happens in those initial way-too-many seconds on startup? surely there must be some significant pauses for some threads there waiting on i/o or on network access to a db server or who knows? [20:29:16] any recent jdk updates? or any due? [20:29:21] It wouldn't be the first time I see some service not starting as it should, because somebody changed some setting at runtime, the server rebooted and it picked the "old" value from the config file :) [20:29:38] Reedy: Not terribly recently, no. [20:29:47] bblack: it seems the connection to the DB server is brought up very late in the gate [20:29:48] the server had quite some uptime though [20:29:57] pretty much after 200+ secs [20:30:07] I did some heavy stracing [20:30:19] mostly seeing lseek, open, close and futex [20:30:27] it's reading through all the repos [20:30:42] is phab ruled out at this point? as in, can we just shut off all of phab, and restart gerrit, and gerrit still sucks? [20:30:46] yes [20:30:51] we 've stopped it [20:30:54] ok [20:30:54] bblack: phab is shut off [20:31:04] Reedy: dont think so, we dont have ensure=>latest and nothing in apt/history.log [20:31:41] it started at 17:49 [20:32:04] there is an apt-get upgrade on 2016-10-05 19:56:53 [20:32:14] has upgrade quite a few things [20:32:20] oh [20:32:22] yea, but at least no jdk or java in it [20:32:49] akosiaris: can you pastebin a list of upgraded packages? [20:33:04] ruby [20:33:05] I don't think I have access to the server [20:33:24] I can at least help by googling potential issues ;) [20:33:42] https://phabricator.wikimedia.org/P4174 [20:33:47] it does have libc in there [20:33:49] thanks! [20:33:52] but what the ... [20:33:55] ohh [20:33:56] Shit... [20:34:25] libsystemd :p [20:34:37] lol [20:35:28] libxml, apache2, libssl [20:35:34] hm [20:35:35] so [20:35:42] akosiaris@bast4001:~$ sysbench --test=cpu run [20:35:45] any of those could be relevant to this problem I think [20:35:48] total time: 11.3119s [20:35:54] akosiaris@lead:~$ sysbench --test=cpu run [20:36:00] total time: 276.0387s [20:36:02] the upgrade was probably the jessie dist-upgrade [20:36:05] ok, it's CPU related [20:36:07] that we've been doing on various jessies [20:36:08] I am almost sure now [20:36:24] as in the CPU is broken? [20:36:24] but no overheat? [20:36:25] there was that rcu_sched hint during shutdown earlier, too [20:36:39] bblack: Yeah, I did the upgrade since I saw it happening on the other jessies. [20:37:14] twentyafterfour: seems no, temperature was mentioned as ok earlier [20:37:23] usually something like that rcu_sched stall detect means something's horribly broken with a kernel, or a hardware issue, but again it seems unlikely in either case [20:37:41] I 've downgraded libc6 just in case btw [20:37:45] no change [20:37:52] oh wait [20:37:55] yeah I don't think it could be userland... [20:38:08] hmm it's probably shared anyway right ? [20:38:18] rcu_sched stall detect could also just be extremely overworked cpus on iowait or real processing somehow [20:38:36] so random data points to throw you off course... [20:38:45] Um, I think I did upgrade instead of dist-upgrade. linux-meta-4.4 is still held back... Would that be it? [20:38:50] (combined with the other upgrades...) [20:38:55] well a new sysbench would be using the new libc6 anyway and it is as painfully slow as before [20:40:02] while taking a peek at sniffed network traffic to look at the DoS possibilities or whatever, I saw this: [20:40:05] https://phabricator.wikimedia.org/P4175 [20:40:19] but just once in isolation, but it's kinda odd, and odd that I saw it in looking briefly at all [20:40:40] SIP ? [20:40:41] why on earth some SIP protocol scanner hits phab is beyond me, or how it's relevant [20:40:50] aaah I know [20:40:54] err gerrit [20:40:57] looking for misconfigured SIP servers [20:41:00] http://blog.sipvicious.org/ [20:41:03] and spam [20:41:14] ok [20:41:14] Well gerrit is indeed a misconfigured SIP server. [20:41:16] ;-) [20:41:40] sysbench is surely suspectible to general load [20:41:48] I don't think those kinds of results would be valid unless gerrit is also down [20:42:51] if the machine can be demonstrated to have real issue with gerrit offline, that's a whole other thing, yeah [20:43:10] someone has apt-get upgrade open, sitting on the prompt I guess [20:43:18] it's locking looking at package states, etc [20:43:29] thats me sorry [20:43:42] killed [20:44:43] bblack: we can kill gerrit, I doubt it should create a big difference in sysbench output [20:44:47] kernel bugs seem out because it was initially up on an outdated 4.4, and now on the old 3.16 [20:45:02] it at least removes a variable if we're trying to find a non-gerrit fault in the system [20:45:12] btw bast4001 is supposedly older and slowers CPUs [20:45:26] kill gerrit and apache and rerun sysbench [20:45:33] ok doing so [20:46:24] both killed [20:46:24] perf interrupt took too long (2571 > 2500), lowering kernel.perf_event_max_sample_rate to 50000 [20:46:46] that's relatively-normal [20:46:46] rerunning sysbench [20:46:51] 1 thread [20:47:03] same test as on bast4001 [20:47:17] which is btw a Intel(R) Xeon(R) CPU E5-2440 0 @ 2.40GHz box [20:47:21] 6 of them [20:47:33] lead is 16 of Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz [20:47:52] I am not even going into cores/processor discussion here [20:47:57] 11 secs vs 250 is too much [20:48:41] bblack: maybe a microcode update ? [20:48:55] as in.. could it help ? [20:49:00] what's the results we're looking at? [20:49:23] well I started bast4001 like 2 mins ago and it ended 11 secs later [20:49:31] akosiaris@bast4001:~$ sysbench --test=cpu run [20:49:35] total time: 11.3094s [20:49:39] per-request statistics: [20:49:39] min: 1.12ms [20:49:39] avg: 1.13ms [20:49:39] max: 2.82ms [20:49:40] approx. 95 percentile: 1.13ms [20:49:44] still waiting on lead [20:49:44] I don't run it often, I'm just not sure what it means [20:50:02] it's just doing arithmetics to find prime number up to 10000 [20:50:10] so practically integer divisions ? [20:50:23] ok [20:50:24] In this mode each request consists in calculation of prime numbers [20:50:24] up to a value specified by the --cpu-max-primes option. All calculations are performed using 64-bit integers. [20:50:33] so yes 64-bit integer divisions [20:50:40] just making sure it's not for some reason doing far more work on one host than the other [20:51:02] try it with 16 threads? [20:51:05] it gives a total time and average per-req things, but not an "I did this many things" [20:51:12] PROBLEM - HTTPS on lead is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [20:51:12] PROBLEM - SSH access on lead is CRITICAL: Connection refused [20:51:21] I guess the per-req stats should be comparable regardless [20:51:31] total number of events: 10000 [20:51:48] Test execution summary: [20:51:48] total time: 276.3568s [20:51:48] total number of events: 10000 [20:51:48] total time taken by event execution: 276.3330 [20:51:48] per-request statistics: [20:51:49] min: 27.40ms [20:51:50] avg: 27.63ms [20:51:51] max: 29.39ms [20:51:53] approx. 95 percentile: 27.73ms [20:51:53] 276 secs [20:51:58] ok [20:52:07] with a 27msec average [20:52:22] for integer arithmetic [20:52:24] that's clearly wrong [20:52:27] that's a pretty major issue [20:52:32] PROBLEM - gerrit process on lead is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [20:52:40] yeah [20:52:49] it also lines up with that rcu_sched stall message earlier [20:52:53] I 'd say let's find another box to restore the service [20:53:03] +1 [20:53:03] Yeah, I was just going to suggest that [20:53:05] since that's based on time to complete and expected sequence of short operations, basically [20:53:24] "restore" what's in the gerrit repos right now, or some backup that's a day old? [20:53:32] it could be chaos if we really have to used a backdated backup [20:53:33] Let's do what's there now [20:53:36] So we don't lose today's work [20:53:51] yes.. I have no idea how much time rsync will take [20:53:56] So yeah, let's provision a host, apply the roles, rsync the git data over, and we should be good to go [20:53:56] but at least no data loss [20:53:58] Oh and DNS [20:54:11] let's see what we have in spares [20:54:19] there might be some subtle issues with swapping things out [20:54:25] in dmesg i see how it sets CPU PERF_BIAS to 'normal' and "was 'performance'", fwiw [20:54:33] we should have up to date mirrored data on phab [20:54:40] right up to the point of failure [20:54:45] Yep [20:54:53] Last one at 7* something [20:54:55] akosiaris: db1019 should be in spares [20:54:58] lead's IP might be hardcoded somewhere or other in ferm rules or who knows what [20:55:02] not sure if already decomm'ed [20:55:10] bblack: It's just in the role class. [20:55:11] and also, gerrit's public IP is in a row-specific subnet [20:55:12] Should be it [20:55:26] so spare needs to be same row, if you want to keep the IP [20:55:34] Actually, it's in the hiera. [20:55:38] So we could swap if need be [20:55:44] (Even better if we don't have to tho) [20:55:46] so, row C [20:56:29] we could change IPs too, I just worry there will be a long tail of discovering things caring about that [20:56:39] we will need hasharAway for zuul/jenkins when we restore gerrit services [20:57:06] No we won't, we stopped hardcoding that IP [20:57:10] Er, hostname [20:57:14] It's just gerrit.wm.o [20:57:15] regarding lead, is the theory that there is something physically wrong with the processor, or some hideous software bug brought about by the package upgrade? [20:57:40] Blah, firewall for zuul [20:57:41] Needs it [20:57:56] wmf4182 is on the same row [20:58:01] so no need for IP change [20:58:06] ori: I don't think we have a strong indication of which... sounds like hardware failure since akosiaris downgraded libc. sysbench shouldn't depend on any of the other packages [20:58:15] but I may be wrong entirely here [20:58:38] there's also systemd... [20:58:45] hey [20:58:52] :-/ [20:58:59] this is a cpu frequency scaling issue of some kind - think cpu power managemenent, etc... [20:59:02] would it be worth it to disable individual cores [20:59:09] I was thinking cpu overheat [20:59:10] the cpu cores are all running at like 200mhz right now [20:59:25] in case the problem is local to some core? [20:59:26] you can see it in the curf column of atop [20:59:28] all cores [20:59:42] nod [20:59:56] run powertop ? [21:00:04] it gives a lot of detail about cpu idle states [21:00:04] Dereckson: Dear anthropoid, the time has come. Please deploy Create olo.wikipedia.org (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161006T2100). [21:00:05] akosiaris@lead:/sys/bus/cpu/devices/cpu0/cpufreq$ cat scaling_cur_freq [21:00:05] 197167 [21:00:11] root@lead:~# cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq [21:00:12] ok.. so that's low [21:00:14] 185253 [21:00:21] yeah they're all running at ~200Mhz [21:00:22] PROBLEM - PHD should be supervising processes on iridium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 997 (phd) [21:00:28] ok that explains it [21:00:31] question is why [21:00:41] and maybe we can fix it fast now that we know what it is [21:00:45] retro-chic [21:00:49] hahah [21:00:56] I didn't know intel cpus would scale that low [21:01:04] it's like the turbo button on old 8086s :-) [21:01:10] hahahahah [21:01:18] moritzm: that just brought back some memories [21:01:23] lemme reboot the box and log into the BIOS [21:01:28] maybe I 'll find something there [21:01:29] you guys already know what spare system you are taking? [21:01:30] http://askubuntu.com/questions/523640/how-i-can-disable-cpu-frequency-scaling-and-set-the-system-to-performance [21:01:34] or need to know which to take? [21:01:36] reading backscroll, sorry for chiming in with a joke [21:01:39] fixed more than one PC issue back in the day by pushing ye ol' turbo button [21:01:46] would it be normal that they are set to "performance" profile? [21:01:50] or "normal"? [21:01:50] these boxes have 3 settings in the bios [21:01:54] performance [21:02:10] robh: We're looking at wmf4182 since it's in the same row and we can avoid the IP address change [21:02:10] this kinf of thing is pretty kernel-sensitive too [21:02:11] robh: looks like maybe on track to solving this one instead of needing a spare [21:02:13] akosiaris: [ 0.523751] ENERGY_PERF_BIAS: Set to 'normal', was 'performance' [21:02:21] !log rebooting lead one more time [21:02:25] we should probably reboot back to 4.4 (maybe install the newest 4.4 we're doing everywhere else too) [21:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:02:28] ... [21:02:29] bblack: yes [21:02:36] ostriches: thats an out of warranty spare slated to be pulled out of the rack [21:02:37] it doesn't have the new 4.4 yet [21:02:57] you just need a row c spare? [21:02:58] we're ignoring iridium? [21:03:03] apergos: for now [21:03:04] apergos: yes [21:03:10] robh: Row c is best, yeah [21:03:10] entering BIOS [21:03:13] ok [21:03:49] apergos: iridium has phd disabled - don't wanna be mirroring from gerrit, trying to eliminate phab as a source of the problem [21:04:04] polonium is in row c and spare (in warranty) [21:04:06] Yeah. It was causing load but not causing the problem, clearly [21:04:11] ACKNOWLEDGEMENT - PHD should be supervising processes on iridium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 997 (phd) daniel_zahn currently stopped on purpose [21:04:30] i dunno the memory amount in current (bad) system [21:04:39] thanks twenty afterfour [21:04:43] the bad system is "lead" [21:04:47] Performance Per Watt (DAPC) [21:04:48] twentyafterfour one problem though when gerrit comes back on, phabricator will probaly most likly take 1-6 hours to git clone and update the repos [21:04:53] due to the new phab update that [21:04:54] Performance Per Watt (OS) <= current [21:04:59] Performance [21:05:03] paladox: that's ok [21:05:04] Dense Configuration [21:05:05] does that when git clone fails [21:05:10] Custom [21:05:10] akosiaris: I think PPW (OS) is our usual correct setting, right? [21:05:19] not sure anymore [21:05:21] robh: 32g I think [21:05:22] robh: ^ ? [21:05:33] oh, the polonium is half that, not good enough [21:05:42] Yeah def need the ram for the jvm [21:05:47] Most important thing [21:05:48] more likely something's wrong inside the host with the dist-upgrade. we could have an issue with cpufrequtils package or some settings somewhere related? [21:05:53] wait it's got 16 cpus and only 32g ram? [21:05:55] PROBLEM - Host lead is DOWN: PING CRITICAL - Packet loss = 100% [21:06:12] robh: do you know what our usual bios setting is for cpu power management? [21:06:43] I am gonna switch to performance and test [21:06:47] performance per watt (os) is the one that has the least issues and we use iirc [21:06:51] ok [21:06:56] well, here's an issue ... [21:07:18] yeah but as far as we know this issue started with the box online, nobody's been messing in bios settings [21:07:36] I still think some package update -> some bad setting with linux-level cpu power management -> etc [21:07:46] Yes, nobody was doing any maintenance when it started. [21:07:49] this starts to sound like we'll end up replacing the mainboard [21:07:55] (either that or it is some kind of hardware fault) [21:07:56] Last maintenance was yesterdays (possibly botched?) dist-upgrade [21:08:00] ok, spare system WMF4725 has 32gb memory and duwl Intel® Xeon® Processor E5- 2623 V3 (3ghz/4core) [21:08:13] if a spare is needed, thats in row c [21:08:35] just file a hw-task after the fact stating you took it, so and such, etc... [21:08:49] akosiaris: ? [21:08:59] lemme see if i can find a free element (save you guys the trouble) [21:09:04] sounds fine to me [21:09:16] I am booting 4.4 on lead with performance btw [21:09:36] akosiaris: PerfOptimized? [21:09:48] RECOVERY - HTTPS on lead is OK: SSL OK - Certificate gerrit.wikimedia.org valid until 2016-12-10 04:23:00 +0000 (expires in 64 days) [21:09:55] RECOVERY - Host lead is UP: PING OK - Packet loss = 0%, RTA = 1.14 ms [21:09:56] volans: it was called performance... now what the well that is... [21:10:20] Please tell me the help text said "Make things fast" [21:10:23] isn't the dist-upgrade rather a red herring? according to https://etherpad.wikimedia.org/p/gerrit-outage-20161006 problems started at 17:49 today, but the bigger upgrade took part yesterday at 19:55 [21:10:32] Test execution summary: [21:10:33] total time: 10.3632s [21:10:33] total number of events: 10000 [21:10:33] total time taken by event execution: 10.3623 [21:10:33] per-request statistics: [21:10:34] min: 0.99ms [21:10:35] avg: 1.04ms [21:10:36] max: 2.51ms [21:10:37] ok [21:10:39] ostriches: well.. almost [21:10:40] I'm looking at a DELL PDF for BIOS.SysProfileSettings.SysProfile, surely different version though [21:10:44] moritzm: Well it's either that, or an actual hardware fault at this point. [21:10:44] so we are probably back in action [21:10:47] and it's all bening updates we've had on a lot of hosts across the cluster by now [21:10:49] bblack: you rule btw [21:10:52] good [21:11:06] cobalt is now a free hostname. so if you guys need a new box for lead replacement, use WMF4725 (name it cobalt) [21:11:07] RECOVERY - gerrit process on lead is OK: PROCS OK: 1 process with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [21:11:16] So it was the botched dist-upgrade? [21:11:18] no [21:11:20] its also in row c. [21:11:21] performance in the bios [21:11:21] well, the fact that flipping to Performance in BIOS "fixes" it doesn't really tell us the underlying problem [21:11:25] but maybe it's good enough for now [21:11:31] org.eclipse.jetty.server.Server : Started @50479ms [21:11:32] other machines don't need that to perform fine [21:11:40] yes, let's bring the service online for now [21:11:46] and me go to sleep [21:11:49] ok [21:11:54] and let's figure out the rest another time [21:11:58] so gerrit is up and running [21:12:04] ostriches: wanna enable apache ? [21:12:38] RECOVERY - SSH access on lead is OK: SSH OK - GerritCodeReview_2.12.3 (SSHD-CORE-0.14.0) (protocol 2.0) [21:12:38] fast startup in logs this time? [21:12:38] yes [21:12:38] !log lead: enabling & running puppet again, should bring things back up [21:12:38] it was mentioned earlier that this server had an upgrade but not the kernel upgrade, while other servers had both at the same time [21:12:54] yeah, I still think something went off with various updates [21:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:13:11] we should probably bring it into a full state of proper dist-upgrade + latest-4.4, like we've done on several others recently [21:13:18] but let's just stabilize for now and look at that later [21:13:32] unbelievable (just finished the backread) [21:14:05] the entire box is now where it should be [21:14:13] heavy IOwait reading all the repos and caching stuff [21:14:19] RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [21:14:40] "This is what your modern Java code's performance looks like when we simulate running it on a 200Mhz Pentium MMX from 1997" [21:14:42] the ENERGY_PERF_BIAS line is gone from dmesg too [21:14:50] rotfl bblack [21:14:53] git pull worked fine [21:14:55] bahahahaha [21:15:01] lol [21:15:34] yaaay, thanks everyone :D [21:15:36] wtmff [21:16:01] should have copied the dmesg from before the last reboot [21:16:10] it will still be there, no? [21:16:20] check the logs [21:16:22] kernel.log [21:16:47] yes, syslog [21:16:56] and messages [21:17:03] and kern.log [21:17:12] funny thing is this is there only after the reboots [21:17:23] I don't find anything close to 17:4X [21:17:26] Could someone restart zuul please to pick up gerrits online? [21:17:31] a yes [21:17:37] One sec, I got it [21:17:44] cool [21:18:05] RECOVERY - puppet last run on kafka2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:18:11] thanks [21:18:46] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [21:19:22] RCU restricting CPUs from NR_CPUS=512 to nr_cpu_ids=48. [21:19:39] RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [21:20:06] Bleh, zuul... [21:20:22] * Zuul Merger: /etc/default/zuul-merger is not set to START_DAEMON=1: exiting [21:20:25] What do I do? [21:20:25] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [21:20:48] well, the CPUs definitely slowed down at that 17:4x timeframe [21:21:04] you can see in the main syslog: [21:21:06] Oct 6 17:23:06 lead puppet-agent[23425]: Sleeping for 6 seconds (splay is enabled) [21:21:14] ^ the loglines from that puppet run have "normal" timing [21:21:21] (seconds apart) [21:21:26] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:21:27] ostriches: did you use /etc/init.d/zuul restart ? [21:21:37] I did! [21:21:37] :) [21:21:43] the next puppet run looks odd/slow [21:21:50] Er, sbin/service. [21:21:54] Which is what I have sudo to [21:21:58] Oct 6 17:57:53 lead puppet-agent[27983]: Caching catalog for lead.wikimedia.org [21:22:01] Oct 6 17:58:50 lead puppet-agent[27983]: Applying configuration version '1475776442' [21:22:04] Oct 6 18:03:33 lead puppet-agent[27983]: (/Stage[main]/Gerrit::Proxy/Letsencrypt::Cert::Integrated[gerrit]/Exec[acme-setup-acme-gerrit]/returns) executed successfully [21:22:10] minutes [21:22:20] ostriches zuul still not working [21:22:22] Last reconfigured: Thu Oct 06 2016 14:50:21 GMT+0100 (GMT Summer Time) [21:22:25] I'm aware. [21:22:26] and the previous run for those entries? [21:22:34] tries it [21:22:41] !log restarting zuul on gallium [21:22:45] clearly around there, there was also some conflict/racing happening between a very very slow and still-running-forever puppet agent and some manual debugging of the problem [21:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:23:10] * Restarting Zuul Server zuul ... waiting for jobs to complete root@gallium:~# [21:23:19] Works now [21:23:21] thanks [21:23:27] https://integration.wikimedia.org/zuul/ [21:23:42] Oct 6 19:04:43 lead kernel: [16258451.121387] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 1.005 msecs [21:23:50] ^ other random signs of super-slowness [21:24:01] could someone restart phab/phd? [21:24:11] Let's hold off on that for a bit [21:24:17] so the CPU slowdown was definitely there at the start of the problem time, and not before. and it wasn't just some artifact after we rebooted either [21:24:17] I want to watch load a little more closely [21:25:02] bblack: so I would exclude a "dormient" bios setting that got enabled at reboot [21:25:12] right [21:25:18] the bios change is just a workaround/hack [21:25:19] any entry in the apt log in between the good run and the bad one? [21:25:36] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:25:47] or any cron job even (something that might have picked up a previous apt change) [21:26:02] Worth a bios reflash or update? [21:26:02] just normal things like debian sa stuff for cron [21:26:09] and apt history is empty since almost a day before the problem [21:26:11] meh [21:26:58] could still be a sort of hardware issue that needs a box replacement or board replacement, etc [21:27:02] According https://integration.wikimedia.org/zuul/ Zuul works fine now. [21:27:09] :-) [21:27:11] maybe just a very small hardware issue that makes dynamic cpu frequency management not work :) [21:27:12] IIRC the Dells have an internal eventlog for hardware errors, not sure if we can access that from the running system? [21:27:24] nothing theer [21:27:25] this has hardware error written all over it [21:27:26] there* [21:27:29] moritzm: I think akosiaris already looked at SEL and drac logs and found nothing [21:27:32] ok [21:27:35] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:27:39] looked at SEL and DRAC twice [21:27:45] PROBLEM - puppet last run on elastic1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:27:45] nada in drac? oh that's annoying [21:27:55] even cross checked DRAC's time to be sure [21:28:28] If we're thinking it could be hardware, maybe we should go ahead and move off before it totally goes kaput? [21:28:32] guess we'll ask Dell, they'll want DRAC logs, then they'll send a new board [21:28:46] they'll want some sort of diagnostic crapola too I bet [21:29:05] but that's down time, moving off sounds better and better [21:29:40] yeah tomorrow [21:29:42] ostriches: et al: just FYI, I'm going to email engineering@ and wikitech-l@ that it was down, a courtesy notice really [21:29:47] RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:29:54] with working CPUs and all that, we can take a planned downtime and rsync off the repos properly and move to another hardware [21:30:00] if that still makes sense tomorrow [21:30:00] yeah not instantly [21:30:03] greg-g: ty [21:30:06] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:30:13] unless it falls over again in the meantime [21:30:29] I suspect the current state will be stable [21:30:34] We could start provisioning [21:30:48] yea, let's still put the role on that spare box too [21:30:53] I think something minor failed in the drac/motherboard crap that broke linux->dellstuff->cpu power/freq management [21:31:00] Get the OS and element assigned [21:31:02] and forcing it to "performance" in bios works around all of that complexity [21:31:39] acpi yuckiness [21:32:14] it's telling that the drop to 200Mhz happened at runtime under 4.4 (as evidence by timings in syslog, etc), and then when we actually saw it we had already rebooted to 3.19 [21:32:36] so it survives reboots and large changes in kernel version, and no related syslogs or dmesgs, probably hardware-induced [21:32:43] 3.16 to be precise, but yeah, agreed [21:32:46] RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [21:32:53] sure does rule out kernel bug as likely [21:33:26] the apt questions were in case of some bizarre configuration change someplace. but clearly not that either [21:33:38] I strongly suspected that too, but the timing just doesn't add up [21:33:48] hardware smell indeed [21:34:06] this is one of those problems where all of one's usual intuitions are wrong [21:34:23] you can say that again [21:34:26] :-D [21:34:30] In a few moments, if all is fine in Zuul/Gerrit, I'll start olo.wikipedia.org add wiki process. [21:34:38] some java process is running way too slow... probably the *last* thing you think is "oh maybe the CPUs themselves are running too slow" [21:34:44] hahahahahaha [21:34:46] RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [21:34:53] Gerrit is down -> it's Gerrit's fault! [21:34:55] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [21:34:57] Was my first thought ;-) [21:34:59] ostriches: you OK with Dereckson proceeding with the wiki creation? [21:35:06] well you'd be right 99 times out of 100 [21:35:10] this was the 100th is all [21:35:24] ostriches: I need a functional Gerrit/Zuul for that, as there will be some changes to merge. [21:35:31] Zuul should be fine now [21:35:44] But does thcipriani want to try and catch up group1? [21:36:29] * apergos wanders off to do nighttime things... good luck all [21:36:42] apergos: g'nite [21:36:57] ostriches: ugh, right, that first [21:37:03] I wanted to get group1 and 2 up to date today but I missed my window entirely :( [21:37:09] we have time [21:37:11] I am off to bed too, bye everyone [21:37:15] bye! [21:37:18] akosiaris: g'nite. Thanks!! [21:37:29] sorry Dereckson, let's get the tain back up to speed first [21:37:36] thcipriani: as olo is for group2, that's better for me too [21:37:47] * greg-g nods [21:37:52] ok, let's see how this goes... [21:37:58] greg-g: yes sure, train is prioritary [21:38:58] robh: Which system did you suggest in row c again? I'm gonna go ahead and file the task [21:39:25] ostriches WMF4725 [21:39:32] ostriches: cool, put #hw-requests on it and since its not an emergency we'll approve normally [21:39:34] WMF4725 [21:39:57] normally being just waiting for tomorrow [21:40:10] legoktm: AaronSchulz cherry picking https://gerrit.wikimedia.org/r/#/c/314461/ to wmf.21, FYI [21:40:19] (last time, promise) [21:40:27] thcipriani: alright [21:40:30] i would undo that topic lock if i knew how. [21:40:37] its not the topiclock setting with chanserv. [21:40:46] doesn't bother me, I have +op :) [21:40:53] yeah but it bugs me! [21:40:55] mode -t [21:41:17] (03PS1) 10Dzahn: add IPs for cobalt, using WMF4725 [dns] - 10https://gerrit.wikimedia.org/r/314601 [21:41:31] ^ adding wmf4725 as "cobalt" like robh said to do [21:41:37] mutante: nono [21:41:45] i thought we were wiating for it to be approved? [21:41:49] since its not an emergency? [21:41:55] thcipriani: so what's with https://gerrit.wikimedia.org/r/#/c/311206/ - I understand due to gerrit being on strike it wasn't merged [21:42:14] i thought the consensus was to set that up up in case it's a hardware issue and it goes down again [21:42:22] and to rsync data to it? [21:42:37] is that what is happening? [21:43:11] SMalyshev: ah, right, SWAT. I could probably get that out while I'm waiting on zuul to merge the backport needed for wmf.21 if you're game for that. [21:43:14] Well I wanted to get tasks filed, but we can get it done tomorrow :) [21:43:23] thcipriani: sure [21:43:23] i thought we were just filing a task for normal approvals. if its being setup now ok, but if it ends up being not used then the onsite has to waste time wiping [21:43:34] just trying to avoid adding to onsite work load [21:43:35] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311206 (https://phabricator.wikimedia.org/T117032) (owner: 10Smalyshev) [21:43:59] but either way, a task should be filed so all the patchsets can reference it [21:44:04] (since we arent in outage condition) [21:44:04] (03Merged) 10jenkins-bot: Add config for units on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311206 (https://phabricator.wikimedia.org/T117032) (owner: 10Smalyshev) [21:44:39] no outage = no reason not to followt he server lifecycle document. [21:44:48] SMalyshev: live on mw1099, check please [21:45:16] I think history of cruft has proven that not following it and referening tasks leads to orphaned shit. =] [21:45:23] thcipriani: ok give me a min [21:45:34] robh a gerrit outage could happen at any time, since setting bios to performance should not be neccisary. [21:45:54] paladox: yet ostriches just said he planned to file tasks and this could happen after that [21:46:02] so why not just file the task so we have a history? [21:46:29] (03PS1) 10Eevans: Add time-window compaction strategy jar to classpath [puppet] - 10https://gerrit.wikimedia.org/r/314603 (https://phabricator.wikimedia.org/T133395) [21:46:41] Yeh we should file one for history [21:47:51] T147596 [21:47:52] T147596: Allocate spare misc box in eqiad for gerrit replacement - https://phabricator.wikimedia.org/T147596 [21:47:54] thcipriani: seems to be working just fine [21:48:38] (03CR) 10Muehlenhoff: [C: 032] Remove ca.patch, now obsolete [debs/openssl] - 10https://gerrit.wikimedia.org/r/314566 (owner: 10Muehlenhoff) [21:48:50] SMalyshev: ok, will go live everywhere. I'm going to sync unitConversionConfig.json and then Wikibase-production.php is that correct? [21:48:59] (03CR) 10Muehlenhoff: [C: 032] Update cloudflare patch for 1.0.2i [debs/openssl] - 10https://gerrit.wikimedia.org/r/314567 (owner: 10Muehlenhoff) [21:49:15] thcipriani: yes, json should be there by the time php config is in effect [21:49:25] okie doke, doing [21:49:39] mutante: reference T147597 on your dns change [21:49:39] T147597: setup/deploy cobalt as gerrit warm standby/replacement - https://phabricator.wikimedia.org/T147597 [21:49:42] its the setup task for cobalt [21:49:54] i'll snag the network port right now and set it up [21:49:59] public vlan [21:50:05] yep, i am, just waiting for the save [21:50:16] (03PS2) 10Dzahn: add IPs for cobalt, using WMF4725 [dns] - 10https://gerrit.wikimedia.org/r/314601 (https://phabricator.wikimedia.org/T147596) [21:50:17] there it is [21:50:23] hw-request just has the request, then a sub-task is for the actual work [21:50:28] so T147597 is better than T147596 =] [21:50:38] as thats the setup task. [21:50:50] ok [21:51:22] (03PS2) 10Eevans: Add time-window compaction strategy jar to classpath [puppet] - 10https://gerrit.wikimedia.org/r/314603 (https://phabricator.wikimedia.org/T133395) [21:51:27] (03PS1) 10Ori.livneh: AbuseFilter: Use new parser from I4aea5f00 on Labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314604 [21:51:28] !log thcipriani@tin Synchronized wmf-config/unitConversionConfig.json: SWAT: [[gerrit:311206|Add config for units on Wikidata (T117032)]] PART I (duration: 00m 48s) [21:51:29] T117032: Create configuration for specifying units conversions - https://phabricator.wikimedia.org/T117032 [21:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:51:45] (03PS3) 10Dzahn: add IPs for cobalt, using WMF4725 [dns] - 10https://gerrit.wikimedia.org/r/314601 (https://phabricator.wikimedia.org/T147597) [21:51:49] I don't wanna futz with puppet & shiz on the new box today, but we can at least get the OS installed and the stuff prepped :) [21:52:03] We'll plan an official downtime tomorrow for that [21:52:10] ostriches: i figured we can start an rsync [21:52:43] thcipriani: could you let me know when you're done deploying? [21:52:49] ori: sure [21:52:52] tnx [21:52:53] well, or restore of the bacula backup [21:52:58] to the new box [21:53:06] !log thcipriani@tin Synchronized wmf-config/Wikibase-production.php: SWAT: [[gerrit:311206|Add config for units on Wikidata (T117032)]] PART II (duration: 00m 50s) [21:53:06] Then rsync the difference tomorrow when we migrate. [21:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:53:12] ^ SMalyshev live everywhere [21:53:26] RECOVERY - puppet last run on elastic1023 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [21:53:58] ori: if you have something you want to get out quickly you can do so now. I'm going to be fiddling with rolling out the train for a bit after now(ish). [21:54:05] thcipriani: thanks! [21:54:10] cool, thanks -- and yes, I do -- labs only patch [21:54:32] ok, lemme know when it's clean and I'll go back to train fiddling [21:54:40] clear even [21:54:54] (03CR) 10RobH: [C: 031] add IPs for cobalt, using WMF4725 [dns] - 10https://gerrit.wikimedia.org/r/314601 (https://phabricator.wikimedia.org/T147597) (owner: 10Dzahn) [21:55:05] (03CR) 10Ori.livneh: [C: 032] AbuseFilter: Use new parser from I4aea5f00 on Labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314604 (owner: 10Ori.livneh) [21:55:10] network port is done [21:55:29] mutante: i imagine you are already on the next steps, feel free to claim task [21:55:37] (03Merged) 10jenkins-bot: AbuseFilter: Use new parser from I4aea5f00 on Labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314604 (owner: 10Ori.livneh) [21:57:16] ostriches should i update https://wikitech.wikimedia.org/wiki/Lead to say migration to a new server cobalt [21:57:22] ? [21:58:12] robh: ok [21:58:15] PROBLEM - HHVM jobrunner on mw1161 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:58:33] (03CR) 10Dzahn: [C: 032] add IPs for cobalt, using WMF4725 [dns] - 10https://gerrit.wikimedia.org/r/314601 (https://phabricator.wikimedia.org/T147597) (owner: 10Dzahn) [22:00:59] !log ori@tin Synchronized wmf-config/abusefilter.php: If794eb2a: AbuseFilter: Use new parser from I4aea5f00 on Labs (duration: 00m 49s) [22:01:03] thcipriani: done [22:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:01:15] ori: thank you [22:03:22] robh: raid10 in the ticket but i think it's raid1 [22:03:39] 4 * 4TB sata [22:03:41] I've updated both https://wikitech.wikimedia.org/wiki/Lead and https://wikitech.wikimedia.org/wiki/Cobalt [22:03:44] may as well use the disks [22:04:16] paladox: i didnt realize anyone actually made pages for servers anymore ;] [22:04:29] oh [22:04:29] robh: but there is no partman recipe raid10-lvm, but lead uses raid1-lvm [22:04:30] if its not auto maintained, it seems like its always going to be suspect [22:04:40] !log thcipriani@tin Synchronized php-1.28.0-wmf.21/includes/libs/rdbms/loadbalancer/LoadBalancer.php: [[gerrit:314600|Ignore reuseConnection() errors after LoadBalancer/LBFactory destruction (T147520)]] (duration: 00m 50s) [22:04:40] i was only doing it since there was a gerrit one [22:04:41] T147520: Warning: Destructor threw an object exception: exception 'InvalidArgumentException' with message 'LoadBalancer::reuseConnection: connection not found, has the connection been freed already?' in /srv/mediawiki/php-1.28.0-wmf.21/includes/libs/rdbms/loadbala - https://phabricator.wikimedia.org/T147520 [22:04:42] mutante: i know but lead is 2 * 500GB is why [22:04:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:05:02] there isnt a raid10-lvm-ext4-srv eh? [22:05:05] maybe we should maek one [22:05:30] i know using lvm and srv is, ext4 is always better over xfs for larger disks [22:05:37] so seems easier to just use ext4 entirely [22:05:48] i was not planning to write a new partman recipe for this one spare box [22:05:56] its not just one box, but i can write it =] [22:06:10] we have lots of 4 * 4tb that have been used with lesser partman recipes cuz no one wants to write them, ehh [22:06:23] lemme slap one together right now =] [22:06:27] alright [22:06:57] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.28.0-wmf.21 [22:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:07:07] so time to re-enable phd? [22:07:29] mutante: i think raid10-gpt-srv-lvm-ext4 would work even when not GPT required disks, but checking [22:07:38] RECOVERY - PHD should be supervising processes on iridium is OK: PROCS OK: 19 processes with UID = 997 (phd) [22:07:49] mutante: wanna try it? [22:07:50] 06Operations, 10ops-eqiad: investigate spare ex4500 serial number - https://phabricator.wikimedia.org/T147590#2697592 (10RobH) [22:07:52] 06Operations, 10ops-eqiad: investigate spare ex4500 serial number - https://phabricator.wikimedia.org/T147590#2697577 (10RobH) [22:08:01] 06Operations, 10ops-codfw: update/audit serial of EX4300-spare2-codfw - https://phabricator.wikimedia.org/T147592#2697610 (10RobH) [22:08:03] robh: ok [22:08:09] !log phd enabled on iridium [22:08:15] if it doesnt then we'll slap another one together, but it would be good to know [22:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:08:23] and it would simplify our large selection of recipes [22:09:11] yes, that would be nice, we have a lot of them and wanted to reduce anyways [22:09:28] i'll try it [22:09:29] i've been slowly removing old ones as we've decommissioned the varying hardware types [22:09:47] 06Operations, 10Gerrit, 10hardware-requests: Allocate spare misc box in eqiad for gerrit replacement - https://phabricator.wikimedia.org/T147596#2697736 (10demon) [22:09:50] 06Operations, 10Gerrit, 10hardware-requests: Allocate spare misc box in eqiad for gerrit replacement - https://phabricator.wikimedia.org/T147596#2697752 (10demon) [22:09:52] yep, i also deleted 2 or 3 that were not used anymore [22:09:59] awesome [22:10:01] 06Operations, 10Gerrit, 13Patch-For-Review: setup/deploy cobalt as gerrit warm standby/replacement - https://phabricator.wikimedia.org/T147597#2697753 (10RobH) [22:10:12] 06Operations, 10Gerrit, 13Patch-For-Review: setup/deploy cobalt as gerrit warm standby/replacement - https://phabricator.wikimedia.org/T147597#2697796 (10RobH) [22:10:24] 06Operations, 10Gerrit, 13Patch-For-Review: setup/deploy cobalt as gerrit warm standby/replacement - https://phabricator.wikimedia.org/T147597#2697753 (10RobH) [22:10:34] 06Operations, 10Gerrit, 10hardware-requests: Allocate spare misc box in eqiad for gerrit replacement - https://phabricator.wikimedia.org/T147596#2697814 (10RobH) a:03mark Since this was potentially an outage condition, I recommended wmf4725 (hostname to be cobalt) be allocated for this. We've already star... [22:10:39] 06Operations, 10Gerrit, 10hardware-requests: Allocate spare misc box in eqiad for gerrit replacement - https://phabricator.wikimedia.org/T147596#2697736 (10RobH) [22:10:54] 06Operations, 10Gerrit, 13Patch-For-Review: setup/deploy cobalt as gerrit warm standby/replacement - https://phabricator.wikimedia.org/T147597#2697832 (10Dzahn) a:03Dzahn [22:10:57] 06Operations, 10Gerrit, 13Patch-For-Review: setup/deploy cobalt as gerrit warm standby/replacement - https://phabricator.wikimedia.org/T147597#2697753 (10Dzahn) [radon:~] $ host cobalt.mgmt.eqiad.wmnet cobalt.mgmt.eqiad.wmnet has address 10.65.2.127 [radon:~] $ host cobalt.wikimedia.org cobalt.wikimedia.or... [22:10:59] Ahh, phd catching up [22:11:07] 06Operations, 10Gerrit, 13Patch-For-Review: setup/deploy cobalt as gerrit warm standby/replacement - https://phabricator.wikimedia.org/T147597#2697753 (10Dzahn) [22:12:24] (03Draft1) 10Paladox: Gerrit: Update error.html message to include channel #wikimedia-operations [puppet] - 10https://gerrit.wikimedia.org/r/314608 [22:12:26] (03Draft2) 10Paladox: Gerrit: Update error.html message to include channel #wikimedia-operations [puppet] - 10https://gerrit.wikimedia.org/r/314608 [22:15:21] PROBLEM - puppet last run on mw1161 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:15:39] (03PS1) 10Dzahn: add cobalt to DHCP, set partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/314609 (https://phabricator.wikimedia.org/T147597) [22:18:19] blerg, starting to see this in the logs: Variable 'wgCirrusSearchBypassPerUserFailure' is not set [22:18:43] wmf.21 maintenance/getConfiguration.php:105 [22:18:58] (03PS2) 10Dzahn: add cobalt to DHCP, set partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/314609 (https://phabricator.wikimedia.org/T147597) [22:19:04] ebernhardson ^^ [22:19:13] (03CR) 10Dzahn: [C: 032] add cobalt to DHCP, set partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/314609 (https://phabricator.wikimedia.org/T147597) (owner: 10Dzahn) [22:21:32] It's a variable defined in default CirrusSearch extension entry point: "Allow failures of the per-user Pool Counter to continue through. This still runs the error callbacks to trigger logging of failures, but does not prevent the search from running. Used to tune the per-user pool counter settings before enabling it fully and blocking queries." [22:21:38] not present in wmf-config [22:22:09] yeah, couldn't find in the wmf-config at all :(( [22:22:13] so it seems the maintenance/getConfiguration.php expects CirrusSearch extension, but doesn't have it [22:25:32] task https://phabricator.wikimedia.org/T147601 [22:25:49] could be a sync issue? Only 11 error and now down at 10 [22:26:09] or caching [22:29:35] mutante, greg-g: Preference on window for tomorrow? [22:29:45] I'm figuring early-ish, get it done. [22:29:52] ostriches: stained glass [22:30:05] lol thx [22:30:17] ostriches: early is good, but i need it to be after about 8.40 [22:30:22] school [22:30:23] wikibugs seems to not have re joined #wikimedia-editing [22:30:40] * wikibugs has quit (Excess Flood) [22:30:50] and -devtools [22:30:50] hrm, well, another one hit right after I filed the task. Would be bad to have fatals in production methinks :\ [22:31:04] mutante: How's 10am our time? [22:31:07] no one from search here? [22:31:11] ostriches: pretty good, yea [22:31:15] ostriches: dont' care really, 'tis Friday [22:31:15] greg-g? [22:31:20] k, 10am it is. [22:32:06] ah, https://phabricator.wikimedia.org/T147601#2697906 [22:32:14] that makes sense [22:32:33] (03PS1) 10Dzahn: add cobalt site.pp, comment gerrit role, access for admins [puppet] - 10https://gerrit.wikimedia.org/r/314612 (https://phabricator.wikimedia.org/T147597) [22:33:10] ostriches: ^ i'm setting it up so you can ssh to it , but without the gerrit role on it [22:33:19] and then we'll go from there tomorrow [22:33:23] thcipriani: if you still plan to deploy to group2 too, patch isn't useful [22:33:32] mutante: Sounds good, thx [22:33:39] Dereckson: yeah, I'll reply [22:34:49] I'll close that task when I've rolled forward and verified. [22:34:59] Which I'm going to do now, I didn't see any other weirdness. [22:36:33] ^ was me making the patch [22:36:48] which makes grrrit-wm sad [22:38:01] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.28.0-wmf.21 [22:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:38:56] mutante: I'm curious if a takeaway from this is that we should have a warm slave we can fall over to. [22:39:37] ostriches: yes, i think it is. i thought that too [22:39:45] Keep it provisioned, rsync the data every so often so swapping is just rsync the delta and swap dns [22:39:50] yes, that [22:40:01] maybe in the other DC ? [22:40:33] Yeah probably best idea [22:40:43] +1 [22:41:02] we should do that with contint1001 once it's ready, as well [22:41:38] the non-element implies there should be 2001, yep [22:41:43] btw, thcipriani, if you need to go long/cut off SWAT, so be it [22:41:50] mutante: /me nods [22:41:52] (03PS2) 10Dzahn: add cobalt site.pp, comment gerrit role, access for gerrit-roots [puppet] - 10https://gerrit.wikimedia.org/r/314612 (https://phabricator.wikimedia.org/T147597) [22:42:00] greg-g: I don't think I will at this point [22:42:11] wmf.21 is everywhere, just monitoring [22:42:21] could be gerrit2001 [22:42:27] LOL [22:42:31] +1 [22:42:46] hell yeah [22:42:51] wait, needs to be generic, codereview2001 :p [22:43:04] Well we have phab..... [22:43:11] even genericker: dev2001 [22:43:11] iridium and phab2001 [22:43:12] lol [22:43:21] thcipriani: ahh, good :) [22:43:30] RECOVERY - puppet last run on mw1161 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:43:38] jvmshit2001 [22:43:55] lol [22:43:57] MHz201 [22:44:00] lifestooshortforjava2001 [22:44:09] lotsofmemoryfordoingnothing2001 [22:45:55] OK, I'll call it deployed. Nothing looks too different since the deploy. [22:46:08] (03PS3) 10Dzahn: add cobalt site.pp, comment gerrit role, access for gerrit-roots [puppet] - 10https://gerrit.wikimedia.org/r/314612 (https://phabricator.wikimedia.org/T147597) [22:46:45] Dereckson: you're clear (sorry I took up most of your window :() [22:48:06] What we could do is start the SWAT now (it's a short one of 3 config patches) and olo afterwards. [22:48:48] jdlrobson: ping? [22:49:00] hi Dereckson [22:49:05] robh: unfortunately it doesnt boot after install :/ [22:49:05] Hello. [22:49:17] not sure yet if that's due to partman [22:49:32] mdadm: No devices listed in conf file were found. [22:49:36] i'm fine with doing swat early provided the train has rolled out to English Wikipedia (my swats are dependent on that being true) [22:49:39] looks like it is [22:49:42] Push footer version 2 to stable isn't a config patch [22:50:18] jdlrobson: so 313906 is blocked by 313426? [22:50:26] Dereckson, I've added https://gerrit.wikimedia.org/r/#/c/313166/ to SWAT [22:50:26] mutante: damn well, that sounds like they didnt spin up in time [22:50:32] id try soft reboot and see? [22:50:43] if not then yeah, lets see about making a new one... [22:50:48] Dereckson: that's correct. I need to do one swat at a time starting with the footer [22:50:57] jdlrobson: https://gerrit.wikimedia.org/r/#/c/313426/ is already included in wmf21 [22:51:11] and thcipriani deployed wmf21 everywhere [22:51:15] Dereckson: \o/ [22:51:26] then yep if you want to start swatting by all means im game :) [22:51:58] (03CR) 10Dereckson: [C: 031] "parent patch https://gerrit.wikimedia.org/r/#/c/313426/ is included in wmf21, deployed for group0-group2, so it's fine now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313906 (https://phabricator.wikimedia.org/T144812) (owner: 10Jdlrobson) [22:52:09] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313906 (https://phabricator.wikimedia.org/T144812) (owner: 10Jdlrobson) [22:52:31] jdlrobson: oh there is also https://gerrit.wikimedia.org/r/#/c/313898 [22:52:49] "Footer code is riding the train so this is still -1ed until Thursday." [22:53:03] Dereckson: yep thats good to go now [22:53:18] (03CR) 10Dereckson: [C: 031] "Train has caught up." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313898 (https://phabricator.wikimedia.org/T145442) (owner: 10Jdlrobson) [22:53:19] ill remove my -1 [22:53:23] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313898 (https://phabricator.wikimedia.org/T145442) (owner: 10Jdlrobson) [22:53:26] (03CR) 10Jdlrobson: [C: 031] "choo choo!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313898 (https://phabricator.wikimedia.org/T145442) (owner: 10Jdlrobson) [22:53:35] (03PS2) 10Dereckson: Push footer version 2 to stable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313898 (https://phabricator.wikimedia.org/T145442) (owner: 10Jdlrobson) [22:53:41] (03CR) 10jenkins-bot: [V: 04-1] Enable RelatedArticles on Minerva skin for all but top 6 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313906 (https://phabricator.wikimedia.org/T144812) (owner: 10Jdlrobson) [22:53:43] (03CR) 10Dereckson: [C: 032] "SWAT, take two" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313898 (https://phabricator.wikimedia.org/T145442) (owner: 10Jdlrobson) [22:53:57] MaxSem: ack'ed [22:53:57] Dereckson, we should discuss wikitech-static & flow tomorrow [22:54:02] Krenair: ok [22:54:08] (03Merged) 10jenkins-bot: Push footer version 2 to stable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313898 (https://phabricator.wikimedia.org/T145442) (owner: 10Jdlrobson) [22:54:29] 06Operations: Deploy a freenode server - https://phabricator.wikimedia.org/T82958#2698001 (10Dzahn) [22:54:34] PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:55:36] (03PS1) 10Chad: Gerrit: Provide ability to specify server as slave [puppet] - 10https://gerrit.wikimedia.org/r/314623 [22:55:47] (03CR) 10Dzahn: [C: 032] add cobalt site.pp, comment gerrit role, access for gerrit-roots [puppet] - 10https://gerrit.wikimedia.org/r/314612 (https://phabricator.wikimedia.org/T147597) (owner: 10Dzahn) [22:56:21] jdlrobson: 313898 [22:56:24] Push footer version 2 to stable live on mw 1099 [22:59:54] Dereckson: w00t [23:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161006T2300). Please do the needful. [23:00:04] Jdlrobson and Dereckson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:12] way ahead of you jouncebot [23:00:45] Dereckson: looking good! [23:00:49] (03CR) 10Dereckson: "You can remove 2016-10-01, 2016-10-03 and 2016-10-04 rules too." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313166 (owner: 10MaxSem) [23:00:50] please sync :) [23:02:37] (03PS2) 10Dereckson: Enable RelatedArticles on Minerva skin for all but top 6 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313906 (https://phabricator.wikimedia.org/T144812) (owner: 10Jdlrobson) [23:02:48] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Enable footer v2 on Minerva for all wikis (T145442) (duration: 00m 50s) [23:02:49] T145442: Move footer changes to stable - https://phabricator.wikimedia.org/T145442 [23:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:03:04] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313906 (https://phabricator.wikimedia.org/T144812) (owner: 10Jdlrobson) [23:03:35] (03Merged) 10jenkins-bot: Enable RelatedArticles on Minerva skin for all but top 6 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313906 (https://phabricator.wikimedia.org/T144812) (owner: 10Jdlrobson) [23:03:58] jdlrobson: 313906 'RelatedArticles on Minerva skin' live on mw1099 [23:04:19] Dereckson: the footer change hasn't synced yet [23:04:24] it's really important that gets synced before this one [23:04:26] 06Operations, 10Gerrit, 13Patch-For-Review: setup/deploy cobalt as gerrit warm standby/replacement - https://phabricator.wikimedia.org/T147597#2698059 (10Dzahn) [23:04:39] 06Operations, 10Gerrit: setup/deploy cobalt as gerrit warm standby/replacement - https://phabricator.wikimedia.org/T147597#2697753 (10Dzahn) [23:05:35] (03PS3) 10MaxSem: throttle: remove expired exceptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313166 [23:05:44] Dereckson, ^ [23:05:48] :) [23:05:55] Dereckson: can you confirm its synced? [23:06:02] yes, it is [23:06:09] 23:02:48 Synchronized wmf-config/InitialiseSettings.php: Enable footer v2 on Minerva for all wikis (T145442) (duration: 00m 50s) [23:06:32] try with ?debug=true at the end of the URL [23:06:32] (03CR) 10Alex Monk: [C: 031] Add $use_ssl switch to role::labs::novaproxy [puppet] - 10https://gerrit.wikimedia.org/r/314441 (owner: 10Andrew Bogott) [23:07:03] if it works, it's just you need to wait 5+ minutes caching expires or we need to purge something [23:07:10] okay great yep seeing it now - cached pages [23:07:12] i can test this change now [23:07:58] (03PS4) 10Dereckson: throttle: remove expired exceptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313166 (owner: 10MaxSem) [23:08:20] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313166 (owner: 10MaxSem) [23:08:52] (03Merged) 10jenkins-bot: throttle: remove expired exceptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313166 (owner: 10MaxSem) [23:09:35] Dereckson: this one is not working so i'm debugging a little [23:09:38] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [23:09:41] jdlrobson: ok [23:10:03] Dereckson: I have a last-minute patch if that's OK [23:10:11] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:10:20] https://gerrit.wikimedia.org/r/#/c/314626/ [23:10:22] MaxSem: throttle cleaning live on mw1099 — no error in the logs [23:10:49] RoanKattouw: ok [23:10:49] Dereckson, WFM too [23:11:04] Dereckson: sorry.. i needto do a follow up [23:11:20] jdlrobson: no problem, we don't have anything for IS, so you aren't blocking other changes [23:11:43] !log dereckson@tin Synchronized wmf-config/throttle.php: Clean expired throttle rules ([[Gerrit:313166]]) (duration: 00m 50s) [23:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:12:32] er actually I see logos for pt.wikimedia *has* an IS config [23:13:28] thanks, Dereckson! [23:13:37] You're welcome. [23:13:41] (03PS1) 10Jdlrobson: Merge config declarations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314627 (https://phabricator.wikimedia.org/T144812) [23:13:50] Dereckson: ^ [23:13:52] silly me [23:14:58] by the way, we could set directly wg, your extensions use extension registration so wg = wmg hack isn't needed anymore [23:15:45] (but of course if mobile CS says $wgRelatedArticlesFooterBlacklistedSkins = $wmgRelatedArticlesFooterBlacklistedSkins, we got null as value in the previous iteration) [23:16:06] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314627 (https://phabricator.wikimedia.org/T144812) (owner: 10Jdlrobson) [23:16:09] (03PS1) 10Chad: Gerrit: Copy public IPs from lead to cobalt, we're reusing them [puppet] - 10https://gerrit.wikimedia.org/r/314628 (https://phabricator.wikimedia.org/T147597) [23:16:15] (03PS2) 10Dereckson: Merge config declarations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314627 (https://phabricator.wikimedia.org/T144812) (owner: 10Jdlrobson) [23:16:21] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314627 (https://phabricator.wikimedia.org/T144812) (owner: 10Jdlrobson) [23:16:47] (03Merged) 10jenkins-bot: Merge config declarations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314627 (https://phabricator.wikimedia.org/T144812) (owner: 10Jdlrobson) [23:16:50] jdlrobson: I'll clean wg/wmg for mobile extensions from the configuration next week, that will save you such trouble. [23:17:06] Dereckson: I raised a bug about this last week ironically :) https://phabricator.wikimedia.org/T147234 [23:17:14] feel free to claim it if you want to have a go that would be AMAZING [23:17:19] oh we've already one [23:18:58] I've added the tracking bug as parent. [23:19:23] 314627 live on mw1099 [23:20:21] RECOVERY - puppet last run on analytics1029 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:21:51] Dereckson: perfect! [23:21:53] sync away :) [23:21:58] thank you for the SWATs! [23:22:11] I did some mwrepl tests, I confirm setting is propagated as expected. [23:23:15] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Enable RelatedArticles on Minerva skin for all but top 6 wikis (T144812) (duration: 00m 50s) [23:23:16] T144812: Deploy related pages to mobile web stable channel - part 1 - https://phabricator.wikimedia.org/T144812 [23:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:23:36] RoanKattouw: could you add your change to https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161006T2300 ? [23:25:31] Yup will do [23:25:40] RoanKattouw: live on mw1099 [23:26:24] (03PS2) 10Chad: Gerrit: provide auto-detection of slave status [puppet] - 10https://gerrit.wikimedia.org/r/314623 [23:28:18] (03PS4) 10Dereckson: Logo for pt.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314539 (https://phabricator.wikimedia.org/T126832) [23:28:35] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314539 (https://phabricator.wikimedia.org/T126832) (owner: 10Dereckson) [23:29:01] (03Merged) 10jenkins-bot: Logo for pt.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314539 (https://phabricator.wikimedia.org/T126832) (owner: 10Dereckson) [23:29:31] 314539 live on mw1099 too [23:30:42] Works fine. [23:31:20] (03PS3) 10Chad: Gerrit: provide auto-detection of slave status [puppet] - 10https://gerrit.wikimedia.org/r/314623 [23:31:34] !log dereckson@tin Synchronized static/images/project-logos/: Logo for pt.wikimedia (T126832, 1/2) (duration: 00m 50s) [23:31:35] T126832: create a wiki for Wikimedia Portugal - https://phabricator.wikimedia.org/T126832 [23:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:34:38] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Logo for pt.wikimedia (T126832, 2/2, no-op for the moment) (duration: 00m 50s) [23:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:34:49] RoanKattouw: you can test it on mw1099? [23:34:58] Looking [23:36:36] Dereckson: I think it works [23:36:43] ok [23:36:53] In that we have a bug that used to be 100% reproducible and I just failed to reproduce it [23:36:58] So that doesn't mean much but it means something [23:37:10] sounds good to me [23:37:26] reproductible test cases are generally helpful to test a fix [23:37:54] !log dereckson@tin Synchronized php-1.28.0-wmf.21/includes/Revision.php: Revision->insertOn: Set READ_LATEST flag (T138310) (duration: 00m 49s) [23:37:55] T138310: Flow as a Beta feature: enable, disable and reenable doesn't seem to work - https://phabricator.wikimedia.org/T138310 [23:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:39:40] (03PS4) 10Chad: Gerrit: provide a way to specify slave mode [puppet] - 10https://gerrit.wikimedia.org/r/314623 [23:41:04] Okay time for olo. [23:44:17] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [23:44:47] !log dereckson@tin Synchronized php-1.28.0-wmf.21/extensions/WikimediaMessages/i18n/wikimediaprojectnames: olo.wikipedia.org project name (duration: 00m 51s) [23:46:06] !log dereckson@tin Synchronized php-1.28.0-wmf.21/extensions/WikimediaMessages/i18n/wikimediainterwikisearchresults/: olo.wikipedia.org project name (duration: 00m 49s) [23:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:49:29] (03PS9) 10Dereckson: Initial configuration for olo.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312807 (https://phabricator.wikimedia.org/T146612) (owner: 10MarcoAurelio) [23:49:47] (03CR) 10Dereckson: "PS9: wikiversions.json update" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312807 (https://phabricator.wikimedia.org/T146612) (owner: 10MarcoAurelio) [23:51:01] (03CR) 10Paladox: [C: 031] Gerrit: Copy public IPs from lead to cobalt, we're reusing them [puppet] - 10https://gerrit.wikimedia.org/r/314628 (https://phabricator.wikimedia.org/T147597) (owner: 10Chad) [23:51:44] (03PS10) 10Dereckson: Initial configuration for olo.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312807 (https://phabricator.wikimedia.org/T146612) (owner: 10MarcoAurelio) [23:52:05] (03CR) 10Dereckson: [C: 032] Initial configuration for olo.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312807 (https://phabricator.wikimedia.org/T146612) (owner: 10MarcoAurelio) [23:52:31] (03Merged) 10jenkins-bot: Initial configuration for olo.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312807 (https://phabricator.wikimedia.org/T146612) (owner: 10MarcoAurelio) [23:59:04] Okay, addWiki is broken [23:59:13] Catchable fatal error: Argument 2 passed to Database::sourceFile() must be callable, boolean given, called in /srv/mediawiki/php-1.28.0-wmf.21/extensions/WikimediaMaintenance/addWiki.php on line 89 and defined in /srv/mediawiki/php-1.28.0-wmf.21/includes/libs/rdbms/database/Database.php on line 3040 [23:59:31] (and indeed addWiki passe false as second argument) [23:59:44] Wants replacing with null