[00:00:05] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170330T0000). [00:03:45] PROBLEM - Disk space on ruthenium is CRITICAL: DISK CRITICAL - free space: / 1775 MB (3% inode=90%) [00:04:53] !log ruthenium - apt-get clean gets a little more disk space [00:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:05] RECOVERY - puppet last run on mw1277 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [00:06:14] !log updating phabricator on iridium [00:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:42] y [00:06:58] subbu: hi, ideas on things we could delete on ruthenium? it seems low on disk [00:07:12] 32G visualdiff [00:07:59] yea, it's almost all /srv/visualdiff/pngs [00:08:19] but since that is on / we need to rotate something [00:10:11] !log ruthenium low on disk space, because /srv/visualdiff/pngs (parsoid-vd-tests) is pretty large and /srv isn't a separate mount [00:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:35] !log Phabricator update completed. [00:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:52] twentyafterfour: :) smooth update it seems. anything interesting that changed in phab? [00:18:24] mutante: search results should be a bit better [00:18:36] I fixed stemming and exact phrase matches [00:19:09] so it'll match agreements when you search for agreement [00:21:43] also git repository polling should be more efficient [00:25:48] twentyafterfour: cool :) thank you [00:43:35] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 569.00 seconds [00:49:35] PROBLEM - puppet last run on db1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:51:45] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 81459.999543 Seconds [00:52:35] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 81511.605569 Seconds [00:52:35] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 81511.61029 Seconds [00:54:35] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds [00:54:35] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds [00:54:45] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds [00:56:44] (03CR) 10Dzahn: [C: 031] "here is one more for (after) Friday" [puppet] - 10https://gerrit.wikimedia.org/r/337207 (https://phabricator.wikimedia.org/T143349) (owner: 10Dzahn) [01:01:21] (03PS1) 1020after4: Phabricator: Fix up elasticsearch cluster config [puppet] - 10https://gerrit.wikimedia.org/r/345488 [01:01:22] mutante: still around? [01:01:44] (03CR) 10Dzahn: [V: 031 C: 031] "wikimedia.bytemark.co.uk has address 212.110.173.211" [puppet] - 10https://gerrit.wikimedia.org/r/345323 (https://phabricator.wikimedia.org/T159331) (owner: 10Reedy) [01:01:48] twentyafterfour: yes [01:02:09] mind merging a config change for phabricator? [01:02:12] https://gerrit.wikimedia.org/r/#/c/345488/ [01:03:15] (03CR) 1020after4: "This also removes mysql from the config since it's been disabled for a while, no need to keep it defined." [puppet] - 10https://gerrit.wikimedia.org/r/345488 (owner: 1020after4) [01:03:57] ok, one moment [01:04:09] (03CR) 10Dzahn: [C: 031] "https://wikimedia.bytemark.co.uk/" [puppet] - 10https://gerrit.wikimedia.org/r/345325 (https://phabricator.wikimedia.org/T159331) (owner: 10Reedy) [01:06:32] so in each data center, read is false but write is true, for itself [01:07:05] eh, for the other one.. [01:07:18] no it should be "write: true" for all, and read: true for same data center [01:07:31] so it'll keep all indexes updated and read from the closest one [01:08:37] the main thing is that it's got two clusters with one host instead of one cluster with two hosts [01:09:08] the code treats clusters as separate and hosts as interchangeable for writes....so the old config wouldn't keep both data centers in sync properly [01:10:00] ok, i get it now, just had to stare at the 3 hiera files again [01:10:14] aha @ clusters [01:10:38] (03CR) 10Dzahn: [C: 032] Phabricator: Fix up elasticsearch cluster config [puppet] - 10https://gerrit.wikimedia.org/r/345488 (owner: 1020after4) [01:10:42] the old config is correct if eqiad and codfw were a single es cluster but they aren't [01:10:55] ok, yes [01:10:58] thanks! [01:11:22] merged on master, i'll let you run puppet on server [01:11:26] ok [01:11:37] !log running `puppet agent --test` on iridium [01:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:17:09] twentyafterfour: i assume you are in the middle of switching it. so the current exception on search form isn't unexpected [01:17:35] RECOVERY - puppet last run on db1009 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [01:22:30] hi chanel - should I create a phab ticket for phab search being broken? [01:22:47] Unhandled Exception ("PhutilAggregateException") All of the configured Fulltext Search services failed. - PhutilAggregateException: All Fulltext Search hosts failed: - PhutilAggregateException: All Fulltext Search hosts failed: [01:24:35] guess everyone is lurking [01:24:35] https://phabricator.wikimedia.org/T161772 [01:27:59] mutante: it's behaving incorrectly [01:28:06] trying to fix it [01:28:17] twentyafterfour: ok! [01:28:47] user created a ticket, i'm making food but i'll check here [01:29:11] thanks! [01:29:45] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 83649.020591 Seconds [01:29:45] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 83739.956605 Seconds [01:29:45] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 83653.760098 Seconds [01:30:35] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 83791.462606 Seconds [01:30:35] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 83791.476055 Seconds [01:31:55] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 83779.205899 Seconds [01:33:35] (03PS1) 1020after4: Turn on read: true to both clusters for now [puppet] - 10https://gerrit.wikimedia.org/r/345491 [01:33:45] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [01:35:18] (03PS2) 1020after4: Turn on read: true to both clusters for now [puppet] - 10https://gerrit.wikimedia.org/r/345491 [01:36:45] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 84073.680358 Seconds [01:36:58] (03PS3) 1020after4: Turn on read: true to both clusters for now [puppet] - 10https://gerrit.wikimedia.org/r/345491 [01:37:45] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds [01:38:19] (03PS4) 1020after4: Turn on read: true to both clusters for now [puppet] - 10https://gerrit.wikimedia.org/r/345491 (https://phabricator.wikimedia.org/T161772) [01:40:21] mutante: ^ when you get a chance [01:40:45] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 84399.866128 Seconds [01:41:38] (03CR) 10Dzahn: [C: 032] Turn on read: true to both clusters for now [puppet] - 10https://gerrit.wikimedia.org/r/345491 (https://phabricator.wikimedia.org/T161772) (owner: 1020after4) [01:41:59] twentyafterfour: go ahead with puppet [01:42:23] thanks [01:43:50] mutante: works! :) [01:43:58] search works for me [01:44:03] was about to say :) ok, cool [01:44:24] now I just need to kill that bogus exception but it'll be good like it is for now [01:44:35] PROBLEM - PHD should be supervising processes on iridium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 997 (phd) [01:44:37] ok, great [01:44:42] eh, except that i guess [01:44:43] uhm.. [01:46:55] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [01:47:02] weird, [01:47:06] it says it's running but it's not [01:47:21] " sudo service phd status [01:47:23] phd start/running [01:47:25] " [01:48:11] stop it fully and start it again, vs restart? [01:48:23] did it get restarted by config change? [01:48:25] tried that [01:48:45] it got restarted by me manually restarting it (just to be sure it picked up everything) [01:48:45] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [01:48:54] * twentyafterfour should have logged that [01:49:28] is manual restart the same that puppet does ? [01:49:36] should be [01:49:38] maybe a pid file owned by wrong user now [01:49:43] looks [01:49:54] Daemon 586105 STDE [Thu, 30 Mar 2017 01:42:24 +0000] Caught signal 15 (SIGTERM). [01:49:55] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 84859.322478 Seconds [01:49:56] Daemon 586105 FAIL [Thu, 30 Mar 2017 01:42:24 +0000] Process exited with error 143. [01:50:06] sigh [01:50:20] oh wait I see something ... [01:50:47] should have seen that happen.. ariel's law [01:52:27] !log phd fixed on iridium. libphutil was out of sync with phd source [01:52:30] finds random pastes by others like https://secure.phabricator.com/P1729 [01:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:52:35] RECOVERY - PHD should be supervising processes on iridium is OK: PROCS OK: 21 processes with UID = 997 (phd) [01:52:41] heh [01:52:44] :) [01:53:08] thanks mutante, sorry for the alerts, that one was my fault [01:54:15] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:54:32] no problem. looks we are good then. i'll step out then. cu later twentyafterfour [01:54:45] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 85155.223919 Seconds [01:54:59] if you need an emergency merge, SMS to number in office wiki list [01:55:07] thanks! have a good evening [01:55:14] thanks, bye [01:56:35] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 37.877426 Seconds [01:56:35] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 37.875977 Seconds [01:56:45] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 49.074035 Seconds [01:56:45] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 50.325248 Seconds [01:56:45] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 52.234984 Seconds [01:56:56] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 56.221606 Seconds [01:57:15] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [02:22:59] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.17) (duration: 07m 20s) [02:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:36:05] PROBLEM - puppet last run on es1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:57:42] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.18) (duration: 13m 43s) [02:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:59:22] it doesn't say , master anymore :( [03:03:05] RECOVERY - puppet last run on es1019 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [03:03:31] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Mar 30 03:03:31 UTC 2017 (duration 5m 49s) [03:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:23:45] RECOVERY - Disk space on ruthenium is OK: DISK OK [03:25:19] mutante, parsoid logs in /srv/log/parsoid tend to grow big. i deleted them for now ... but, it'll start filling up again on the next test run. let us chat tomorrow about limiting log sizes there. [04:22:24] (03Abandoned) 10Subramanya Sastry: Allow parsoid-vd-client service to be controlled outside systemd [puppet] - 10https://gerrit.wikimedia.org/r/344961 (owner: 10Subramanya Sastry) [04:38:34] (03PS2) 10Felipe L. Ewald: Add Bytemark to public_mirrors.html list [puppet] - 10https://gerrit.wikimedia.org/r/345325 (https://phabricator.wikimedia.org/T159331) (owner: 10Reedy) [04:51:25] PROBLEM - puppet last run on db1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:00:55] PROBLEM - puppet last run on analytics1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:17:05] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 29 probes of 426 (alerts on 19) - https://atlas.ripe.net/measurements/1790945/#!map [05:19:25] RECOVERY - puppet last run on db1024 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [05:29:55] RECOVERY - puppet last run on analytics1034 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [05:32:05] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 15 probes of 426 (alerts on 19) - https://atlas.ripe.net/measurements/1790945/#!map [05:43:38] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345495 [05:43:42] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345495 [05:47:37] 06Operations, 10Monitoring: Add slabinfo prometheus exporter - https://phabricator.wikimedia.org/T160071#3142788 (10ema) [05:51:25] !log upgrading twisted to 16.2.0 on lvs100[456] (eqiad secondaries) T160433 [05:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:32] T160433: Upgrade twisted on load balancers to 16.2.0 - https://phabricator.wikimedia.org/T160433 [05:53:36] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345495 (owner: 10Marostegui) [05:54:50] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345495 (owner: 10Marostegui) [05:55:01] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345495 (owner: 10Marostegui) [05:56:07] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1094 - T17441 (duration: 00m 45s) [05:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:13] T17441: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441 [05:56:43] !log Deploy schema change on db2014 - codfw master (this will generate lag on codfw) - T73563 [05:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:49] T73563: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563 [06:04:50] 06Operations, 10ops-eqiad, 10DBA, 10Phabricator: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3142794 (10Marostegui) 05Resolved>03Open This has happened again, so maybe the BBU is indeed faulty. ``` root@db1048:~# date ; mysql --skip-ssl -e "show slave status\G" | grep... [06:06:45] 06Operations, 10ops-eqiad, 10DBA, 10Phabricator: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3142797 (10Marostegui) a:05Marostegui>03Cmjohnson [06:10:12] 06Operations, 10Pybal, 10Traffic, 13Patch-For-Review: Pybal IdleConnectionMonitor with TCP KeepAlive shows random fails if more than 100 servers are involved. - https://phabricator.wikimedia.org/T119372#3142801 (10ema) [06:11:05] 06Operations, 10Pybal, 10Traffic: Make PyBal respect advertised BGP capabilities - https://phabricator.wikimedia.org/T81305#3142806 (10ema) [06:11:27] 06Operations, 10Pybal, 10Traffic: Add pybal check to ensure service IP is bound - https://phabricator.wikimedia.org/T79730#3142807 (10ema) [06:25:41] !log Logging backwards for the record: restart mysql on db1047 for maintenance - T160454 [06:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:49] T160454: Change length of userAgent column on EL tables - https://phabricator.wikimedia.org/T160454 [06:27:35] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [06:28:59] 06Operations, 10ops-eqiad, 10DBA, 10Phabricator: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3142822 (10Marostegui) ``` ˜/icinga-wm 8:27> RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.00 seconds ``` ``` Battery State: Optimal ``` `... [06:32:15] (03PS1) 10Muehlenhoff: Remove a number of obsolete conditionals from mediawiki classes [puppet] - 10https://gerrit.wikimedia.org/r/345502 [06:45:01] !log installing apparmor security updates on trusty [06:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:25] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[percona-toolkit] [06:58:00] (03CR) 10Elukey: [C: 031] Remove a number of obsolete conditionals from mediawiki classes [puppet] - 10https://gerrit.wikimedia.org/r/345502 (owner: 10Muehlenhoff) [07:00:45] PROBLEM - puppet last run on mw1168 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[apparmor] [07:03:10] (03PS1) 10Muehlenhoff: Remove obsolete mediawiki::packages::legacy class [puppet] - 10https://gerrit.wikimedia.org/r/345505 [07:05:13] !log upgrading twisted to 16.2.0 on lvs100[123] (eqiad primaries) T160433 [07:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:20] T160433: Upgrade twisted on load balancers to 16.2.0 - https://phabricator.wikimedia.org/T160433 [07:10:45] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [07:12:20] brief 5xx spike in esams text ^ [07:12:35] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [07:18:05] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [07:18:35] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:19:45] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:24:47] 06Operations, 10Pybal, 10Traffic: Unhandled pybal ValueError: need more than 1 value to unpack - https://phabricator.wikimedia.org/T143078#3142889 (10ema) This problem should be [[https://github.com/twisted/twisted/commit/942b63cc04fba83dabf1958b3ed24af860778681|solved upstream]]. I've just finished upgradin... [07:25:25] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [07:28:45] RECOVERY - puppet last run on mw1168 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [07:29:59] (03PS1) 10Muehlenhoff: Add some more fine-grained debdeploy server groups for openstack [puppet] - 10https://gerrit.wikimedia.org/r/345506 [07:31:25] (03CR) 10Muehlenhoff: [C: 032] Add some more fine-grained debdeploy server groups for openstack [puppet] - 10https://gerrit.wikimedia.org/r/345506 (owner: 10Muehlenhoff) [07:39:57] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3142970 (10Gehel) After multiple tests, generating CPU, memory and IO load on elastic2021, the server has not crashed. Those tests are the same... [07:41:40] !log pull elastic2021 back into active duty - T149006 [07:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:47] T149006: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006 [07:43:57] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic2021.codfw.wmnet [07:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:20] (03CR) 10Elukey: [C: 031] Remove obsolete mediawiki::packages::legacy class [puppet] - 10https://gerrit.wikimedia.org/r/345505 (owner: 10Muehlenhoff) [07:48:57] 06Operations, 10Pybal, 10Traffic: Upgrade twisted on load balancers to 16.2.0 - https://phabricator.wikimedia.org/T160433#3142982 (10ema) 05Open>03Resolved [07:57:33] (03PS1) 10Marostegui: db-eqiad.php: Depool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345507 (https://phabricator.wikimedia.org/T17441) [07:57:46] (03PS2) 10Elukey: Prepare analytics1027 for decommission [puppet] - 10https://gerrit.wikimedia.org/r/345117 (https://phabricator.wikimedia.org/T161597) [07:58:37] (03CR) 10Elukey: "Thanks @BBlack, should be ok now!" [puppet] - 10https://gerrit.wikimedia.org/r/345117 (https://phabricator.wikimedia.org/T161597) (owner: 10Elukey) [08:00:53] 06Operations, 06Commons, 10Traffic, 10Wikimedia-Site-requests, and 2 others: Allow anonymous users to change interface language on Commons with ULS - https://phabricator.wikimedia.org/T161517#3142987 (10Nemo_bis) > I don't think we've been aware of the uselang hack or its mechanics before The documentatio... [08:01:25] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345507 (https://phabricator.wikimedia.org/T17441) (owner: 10Marostegui) [08:02:54] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345507 (https://phabricator.wikimedia.org/T17441) (owner: 10Marostegui) [08:03:11] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345507 (https://phabricator.wikimedia.org/T17441) (owner: 10Marostegui) [08:03:50] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1090 - T17441 (duration: 00m 44s) [08:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:56] T17441: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441 [08:13:21] !log Convert UNIQUE keys to PK on db1090 (s2) - T17441 [08:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:27] T17441: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441 [08:18:05] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [08:19:01] (03PS3) 10Giuseppe Lavagetto: Use internal url for Ores, move to ProductionServices.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316317 [08:19:03] (03PS1) 10Giuseppe Lavagetto: Convert parsoid, mathoid, restbase urls to use the discovery hostname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345508 [08:19:05] (03PS1) 10Giuseppe Lavagetto: Convert cxserver, eventbus to use the discovery url [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345509 [08:19:07] (03PS1) 10Giuseppe Lavagetto: Use discovery url for Ores as well [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345510 [08:23:12] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I don't think having case-sensitive regexes when we have case-insensitive equality is the way to go:" [software/cumin] - 10https://gerrit.wikimedia.org/r/345402 (https://phabricator.wikimedia.org/T161730) (owner: 10Volans) [08:25:05] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [08:34:38] hashar: o/ - I saw your comment in https://github.com/phpredis/phpredis/issues/562 and I wanted to ask what would it take to upgrade phpredis [08:34:55] RECOVERY - Nginx local proxy to apache on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.073 second response time [08:34:56] RECOVERY - HHVM rendering on mw1261 is OK: HTTP OK: HTTP/1.1 200 OK - 73873 bytes in 0.289 second response time [08:35:08] elukey: hello [08:35:15] RECOVERY - Apache HTTP on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.028 second response time [08:35:39] elukey: so I think you/someone or a task has lead me to that issue [08:35:50] and i merely explicitly listed the versions being used [08:35:53] yes the redis timeouts :) [08:36:30] in Zend world that would be a php5-redis.deb package or something that we would need to bump [08:36:46] in HHVM I have no clue. Maybe it uses the Zend extension [08:37:46] yes yes I was reading https://www.mediawiki.org/wiki/Redis [08:38:24] !log repooling mw1261 to reproduce hhvm deadlock with higher debug level [08:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:04] thanks hashar :) [08:40:22] elukey: well I am not sure how helpful I am here :( [08:41:15] hhvm apparently implement redis in plain PHP via hphp/system/php/redis/*.php [08:42:08] I am checking it, and if we have the same quit issue in there too [08:43:11] depends on whether the jobrunner service daemon/loop uses zend or hhvm I guess [08:44:19] /usr/bin/php /srv/deployment/jobrunner/jobrunner/redisJobRunnerService --config-file=/etc/jobrunner/jobrunner.conf [08:44:24] $ readlink -f /usr/bin/php [08:44:24] /usr/bin/hhvm [08:44:44] elukey: so most probably the github issue is for the Zend extension, and we would have to look at hhvm php code [08:44:59] https://gerrit.wikimedia.org/r/#/c/85003/1/includes/clientpool/RedisConnectionPool.php :) [08:45:59] 06Operations, 10ops-eqiad, 10DBA, 10Phabricator: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3143079 (10jcrespo) I believe there was yesterday maintenance or trouble on Phabricator. I would ask RelEng first. [08:46:44] ahh I miss ori :( [08:47:48] 06Operations, 10ops-eqiad, 10DBA, 10Phabricator: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3143084 (10Marostegui) Yep, the deployment page said there was a phabricator update so maybe that put more stress on the server and made the BBU fail (again)? Because the fact tha... [08:48:11] we all miss him :) [08:49:05] PROBLEM - puppet last run on labvirt1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:52:03] (03CR) 10Jcrespo: [C: 031] "Let's go for this and let's revisit next week." [puppet] - 10https://gerrit.wikimedia.org/r/345372 (owner: 10Marostegui) [08:52:56] elukey: then hhvm open the connection with $conn = fsockopen($host, $port, $errno, $errstr, $timeout); and close it with fclose($conn); [08:53:07] no idea how that is relevant though :( [08:55:14] <_joe_> elukey: still trying to resolve the "cannot connect" redis problem as it's client-side? [08:59:14] (03CR) 10Volans: "> I don't think having case-sensitive regexes when we have" [software/cumin] - 10https://gerrit.wikimedia.org/r/345402 (https://phabricator.wikimedia.org/T161730) (owner: 10Volans) [09:01:54] (03PS3) 10Volans: Uniform maintenance message and indentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343859 (https://phabricator.wikimedia.org/T160178) [09:03:33] (03CR) 10Marostegui: [C: 031] Uniform maintenance message and indentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343859 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [09:04:05] PROBLEM - HHVM rendering on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:04:05] PROBLEM - Nginx local proxy to apache on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:04:25] PROBLEM - Apache HTTP on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:04:31] elukey: not me :) [09:04:36] ahahha [09:04:46] this one is the deadlock that Moritz is investigating [09:04:58] depooling mw1261 [09:05:45] !log depooling mw1261 (hhvm-dump-debug in /tmp/hhvm.98736.bt.) [09:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:15] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw1261.eqiad.wmnet [09:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:30] forgot the --quiet :P [09:06:34] moritzm: --^ [09:07:06] _joe_ I am really annoyed but all that spam related to timeouts in logstash :) [09:08:01] hashar: I thought https://github.com/wikimedia/mediawiki/blob/wmf/1.29.0-wmf.18/includes/libs/redis/RedisConnectionPool.php#L393 was the closing part.. [09:08:23] <_joe_> elukey: the problem is the shitton of lua code that we execute repeatedly and that blocks redis [09:08:32] <_joe_> that's the reason of the connection errors [09:09:01] <_joe_> we can either 1) accept to have a larger timeout at least from jobrunners 2) retry 3) add moar instances [09:09:02] _joe_ okok I know but those TCP RSTs are due to the QUIT commands [09:09:16] I wanted to open a hhvm issue on gh [09:09:37] and I am +1 for a larger jobrunner timeout for redis [09:09:41] but I have no clue how to do it :) [09:09:54] <_joe_> why is a TCP RST a bad thing when you close a connection? or do you think that's creating issues in the connectionpool? [09:12:07] well afaics the client (hhvm on jobrunners) send a QUIT to redis and then closes the socket, then redis reply with "OK" and gets a RST [09:12:16] not really a big deal ok but definitely not clean [09:12:26] phpredis fixed the issue avoiding the QUIT [09:12:34] (03CR) 10Jcrespo: [C: 031] Uniform maintenance message and indentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343859 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [09:13:58] elukey: thanks, that was much quicker this time, having a look [09:15:42] hashar: opened https://github.com/facebook/hhvm/issues/7757 :) [09:17:05] RECOVERY - puppet last run on labvirt1009 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [09:18:45] RECOVERY - Check systemd state on netmon1001 is OK: OK - running: The system is fully operational [09:22:45] PROBLEM - puppet last run on dbproxy1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:22:55] PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:25:30] 06Operations, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3143129 (10elukey) Opened https://github.com/facebook/hhvm/issues/7757 for the TCP RSTs. [09:25:35] PROBLEM - Check systemd state on mw1261 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:25:55] PROBLEM - HHVM processes on mw1261 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [09:26:26] ^set downtime, debugging things [09:27:23] (03CR) 10Alexandros Kosiaris: [C: 04-1] "OK, but couple it with the change in line 69 about ruby 1.8. From what I see in If31535d7092d4bb9167fe838253d17a47542349e they were anyway" [puppet] - 10https://gerrit.wikimedia.org/r/345366 (owner: 10Dzahn) [09:27:43] (03CR) 10Volans: [C: 032] Uniform maintenance message and indentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343859 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [09:28:51] (03Merged) 10jenkins-bot: Uniform maintenance message and indentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343859 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [09:29:00] (03CR) 10jenkins-bot: Uniform maintenance message and indentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343859 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [09:32:35] PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:34:23] !log root@tin Synchronized wmf-config/db-codfw.php: Uniform maintenance message and indentation (duration: 00m 44s) [09:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:28] !log root@tin Synchronized wmf-config/db-eqiad.php: Uniform maintenance message and indentation (duration: 00m 47s) [09:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:37] jynus: AFAICT the root@ is just scap failing to detect the proper user, but just to be sure do you know a quick way to check that everything is fine on tin? [09:50:00] volans, did you merge something as root? [09:50:04] no [09:50:10] and neither run it as root [09:50:29] su - volans -c scap sync-file... [09:50:32] what issue are you talking about? [09:50:42] "log root@tin Synch...." [09:50:55] ah, uid issues [09:50:57] the log message from scap says root@, because using os.getlogin() it failed to detect the proper user [09:51:20] but, just to be on the safe side, I would like to ensure that scap was not run as root [09:51:45] RECOVERY - puppet last run on dbproxy1006 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [09:51:52] like if scap do something based on os.getlogin() it might have used the wrong user for something [09:51:55] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [09:52:01] check the git files [09:52:11] you can technically scap as root [09:52:14] those are ok, also because I merged it manually as my user [09:52:27] then I do not see many issue [09:52:38] only the scap is run through cumin that ssh as root and then do su - volans -c scap... [09:52:49] but talk to scap devels, maybe they can fix the user detection [09:53:02] or if they see some issue with that [09:53:06] I would assume no [09:53:15] ok, thanks [09:53:39] in the past, the only complain was not being able to merge [09:53:48] but that should already have an icinga check [09:54:12] ok, thanks, also getlogin is only used for the logging in the code, so we should be ok [10:01:35] RECOVERY - Check systemd state on mw1261 is OK: OK - running: The system is fully operational [10:01:55] RECOVERY - HHVM processes on mw1261 is OK: PROCS OK: 6 processes with command name hhvm [10:01:56] RECOVERY - Nginx local proxy to apache on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.393 second response time [10:02:06] RECOVERY - HHVM rendering on mw1261 is OK: HTTP OK: HTTP/1.1 200 OK - 74017 bytes in 3.717 second response time [10:02:15] RECOVERY - Apache HTTP on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.048 second response time [10:02:35] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [10:03:19] !log repooling mw1261 for additional test [10:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:22] (03CR) 10StudiesWorld: [C: 031] "It seems appropriate given the bug is resolved and it disables it correctly." [puppet] - 10https://gerrit.wikimedia.org/r/345371 (https://phabricator.wikimedia.org/T111760) (owner: 10Dzahn) [10:21:05] (03CR) 10Alexandros Kosiaris: [C: 032] "This will probably require more iterations as we switch authz models (kubernetes is still actively developing them), but looks fine for no" [puppet] - 10https://gerrit.wikimedia.org/r/345187 (owner: 10Dduvall) [10:21:10] (03PS3) 10Alexandros Kosiaris: k8s: Accept any given api server authorization mode [puppet] - 10https://gerrit.wikimedia.org/r/345187 (owner: 10Dduvall) [10:21:13] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] k8s: Accept any given api server authorization mode [puppet] - 10https://gerrit.wikimedia.org/r/345187 (owner: 10Dduvall) [10:25:44] (03CR) 10Hashar: "That is the same as https://gerrit.wikimedia.org/r/#/c/343309/ or am I confusing files somehow? :]" [puppet] - 10https://gerrit.wikimedia.org/r/345505 (owner: 10Muehlenhoff) [10:27:17] (03CR) 10Elukey: "Hi Diego," [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/318068 (owner: 10R4q3NWnUx2CEhVyr) [10:30:55] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Granting wmde group access to grafana-admin.wikimedia.org - https://phabricator.wikimedia.org/T161484#3143260 (10MoritzMuehlenhoff) 05Open>03declined @Addshore: I'm sorry, but we have to decline that request: We've discussed this in the TechOps mee... [10:47:05] 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-fgiunchedi: Rack and Setup ms-be1028-ms-1039 - https://phabricator.wikimedia.org/T160640#3143303 (10fgiunchedi) 1031's port was also a member of labs-instances vlan, removed the port from there and disabled/enabled the port and now 1031 can pxe-boot. 1... [10:49:10] (03Abandoned) 10Muehlenhoff: Remove obsolete mediawiki::packages::legacy class [puppet] - 10https://gerrit.wikimedia.org/r/345505 (owner: 10Muehlenhoff) [10:51:54] (03PS2) 10Muehlenhoff: mediawiki: remove Precise class packages::legacy [puppet] - 10https://gerrit.wikimedia.org/r/343309 (https://phabricator.wikimedia.org/T158652) (owner: 10Hashar) [10:52:42] elukey: so I have a terrible theory. The HHVM redis code close the connection by sending 'QUIT' then immediately closing the socket [10:53:11] elukey: but the redis server might be sending back to client an acknowledgement eg: "OK" [10:53:28] but since the connection got closed, maybe that trigger a TCP RST [10:54:09] the thing I dont get is your wireshark capture shows the RST coming from the client [10:54:32] but that match the phpredis bug [10:55:04] so in the end most probably the HHVM implementation of Redis.close() should wait for an OK after QUIT [10:55:12] eg what phpredis has done [10:56:30] hashar: this is the exact theory that I have :) [10:56:58] yeah so if two of us + phpredis patch have the same conspiracy theory [10:57:00] it must be true! [10:57:22] so maybe that close() https://github.com/facebook/hhvm/blob/master/hphp/system/php/redis/Redis.php#L86-L90 [10:57:33] needs an extra function call to read from the server [10:58:13] (03CR) 10Muehlenhoff: [C: 032] mediawiki: remove Precise class packages::legacy [puppet] - 10https://gerrit.wikimedia.org/r/343309 (https://phabricator.wikimedia.org/T158652) (owner: 10Hashar) [10:58:31] elukey: most probably add a call to processStringResponse() [10:58:35] should returns OK [10:58:53] but I havent read the redis protocol to figure out what payload the server replies with when a client send (str)QUIT [10:59:13] it should be a simple "OK" [10:59:13] going to lunch & [10:59:18] hopefully yeah [10:59:20] phpredis solved it not calling QUIT :) [10:59:26] should be easily testable [10:59:45] with a mock tcp server that handles QUIT and yield after x miliseconcs OK [11:00:04] that should show the RST happening since the socket is closed before 'OK' is sent back [11:00:20] yeah or skip calling quit. But I am not sure how redis server will handle that [11:00:21] ;D [11:00:28] <_joe_> paravoid: did I say I didn't have time to work on the jobrunners, did I? [11:00:43] <_joe_> jouncebot: !next [11:00:48] <_joe_> uhm [11:00:50] <_joe_> !next [11:00:53] jouncebot: refresh [11:00:56] I refreshed my knowledge about deployments. [11:00:57] maybe it died [11:00:58] <_joe_> !next [11:01:00] huh? [11:01:01] jouncebot: next [11:01:01] In 1 hour(s) and 58 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170330T1300) [11:01:09] somehow jouncebot lost its state [11:01:16] <_joe_> ok... I'm adding a few patches there :) [11:01:39] (03PS1) 10Giuseppe Lavagetto: role::mediawiki::common: extract scap::proxy to a profile [puppet] - 10https://gerrit.wikimedia.org/r/345527 [11:01:41] (03PS1) 10Giuseppe Lavagetto: role::mediawiki::common: convert to profile (step 1) [puppet] - 10https://gerrit.wikimedia.org/r/345528 [11:01:43] (03PS1) 10Giuseppe Lavagetto: role::mediawiki::common: convert to profiles (step 2) [puppet] - 10https://gerrit.wikimedia.org/r/345529 [11:01:45] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner: move role to profile, unify videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/345530 [11:01:47] (03PS1) 10Giuseppe Lavagetto: mediawiki: add mediawiki_active_dc function [puppet] - 10https://gerrit.wikimedia.org/r/345531 [11:01:48] ;] [11:04:36] <_joe_> elukey: would you care to review ^^ ? [11:04:44] <_joe_> I know it's a lot :/ [11:05:20] sure I'll do it! [11:07:28] (03CR) 10Elukey: [C: 031] "Tested and it works fine, thanks a lot! This is definitely a bug." [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/317825 (owner: 10R4q3NWnUx2CEhVyr) [11:27:04] (03PS2) 10ArielGlenn: Update Bytemark wikimedia mirror hostname [puppet] - 10https://gerrit.wikimedia.org/r/345323 (https://phabricator.wikimedia.org/T159331) (owner: 10Reedy) [11:27:39] (03CR) 10Elukey: [C: 031] "LGTM, from https://puppet-compiler.wmflabs.org/5970/mw1280.eqiad.wmnet/ (scap proxy) it seems fine" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/345527 (owner: 10Giuseppe Lavagetto) [11:28:35] PROBLEM - puppet last run on ms-be1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:31:31] (03CR) 10Elukey: [C: 031] "LGTM https://puppet-compiler.wmflabs.org/5971/mw1280.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/345528 (owner: 10Giuseppe Lavagetto) [11:35:53] PROBLEM - puppet last run on ms-be1033 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 10 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdc1] [11:36:37] (03CR) 10ArielGlenn: [C: 032] Update Bytemark wikimedia mirror hostname [puppet] - 10https://gerrit.wikimedia.org/r/345323 (https://phabricator.wikimedia.org/T159331) (owner: 10Reedy) [11:37:13] PROBLEM - puppet last run on ms-be1032 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 9 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdc1] [11:37:44] godog: new swift backend hosts? ^^^ [11:42:03] jouncebot, next [11:42:03] In 1 hour(s) and 17 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170330T1300) [11:48:44] (03PS1) 10Marostegui: db-eqiad.php: Remove db1057 from s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345535 (https://phabricator.wikimedia.org/T160435) [11:55:40] !log installing jbig2dec security updates [11:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:43] RECOVERY - puppet last run on ms-be1010 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [12:02:16] !log installing glibc security updates on trusty [12:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:20] (03PS1) 10Alexandros Kosiaris: Remove system::role {'config-master': } [puppet] - 10https://gerrit.wikimedia.org/r/345536 [12:03:22] (03PS1) 10Alexandros Kosiaris: Remove system::role { 'conftool-master': } [puppet] - 10https://gerrit.wikimedia.org/r/345537 [12:03:24] (03PS1) 10Alexandros Kosiaris: Move and rename system::role{ 'role::docker::builder':} [puppet] - 10https://gerrit.wikimedia.org/r/345538 [12:11:03] PROBLEM - puppet last run on mw1214 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:13:24] (03CR) 10Hashar: [C: 031] [cleanup] Remove expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344991 (https://phabricator.wikimedia.org/T161530) (owner: 10Urbanecm) [12:13:52] (03CR) 10Hashar: [C: 031] Allow eliminators and autoreviewers to move a file on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344989 (https://phabricator.wikimedia.org/T161532) (owner: 10Urbanecm) [12:14:44] (03CR) 10Hashar: [C: 031] Assign move-categorypages to sysops&bots only on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345093 (https://phabricator.wikimedia.org/T161551) (owner: 10Urbanecm) [12:15:48] (03CR) 10Hashar: [C: 031] Enable Multimedia Viewer at officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342798 (https://phabricator.wikimedia.org/T160420) (owner: 10Urbanecm) [12:15:58] !log Updated the Constraints table on Wikidata, per T160506. [12:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:05] T160506: Update constraint table - https://phabricator.wikimedia.org/T160506 [12:17:04] (03CR) 10Giuseppe Lavagetto: [C: 031] "I explicitly verified name resolution is correct for all entries and is equivalent to our preceding configuration." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345508 (owner: 10Giuseppe Lavagetto) [12:18:32] !log oblivian@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=eventbus,name=codfw [12:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:16] (03CR) 10Elukey: [C: 031] "LGTM, ran pcc also with some mcXXXX hosts https://puppet-compiler.wmflabs.org/5972 even if not necessary." [puppet] - 10https://gerrit.wikimedia.org/r/345529 (owner: 10Giuseppe Lavagetto) [12:23:00] (03CR) 10Alexandros Kosiaris: [C: 04-1] "nice, a few minor comments inline" (0312 comments) [puppet] - 10https://gerrit.wikimedia.org/r/345310 (https://phabricator.wikimedia.org/T157613) (owner: 10Gehel) [12:23:26] akosiaris: ^ "a few" = 12 ... :P [12:24:41] reminds me volans :P [12:24:46] (03CR) 10Giuseppe Lavagetto: [C: 031] "I've personally looked at all hostname changes and verified everything should stay the same." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345509 (owner: 10Giuseppe Lavagetto) [12:25:26] lol [12:26:07] moritzm! Is it worth me pointing out again that grafana uses a public API and that means all of the data is public again? I find the pii reasoning odd (as did i when initially trying to get access to graphite some years ago) [12:28:03] (03CR) 10Giuseppe Lavagetto: [C: 031] "Uhm ok, I was actually debated about where to put system::role definitions, as some hosts will have one role that is clearly a composition" [puppet] - 10https://gerrit.wikimedia.org/r/345538 (owner: 10Alexandros Kosiaris) [12:28:59] (03CR) 10Giuseppe Lavagetto: [C: 031] Remove system::role { 'conftool-master': } [puppet] - 10https://gerrit.wikimedia.org/r/345537 (owner: 10Alexandros Kosiaris) [12:30:04] gehel: well... most are a ditto :P [12:30:15] yeah, I saw... [12:33:14] (03CR) 10Alexandros Kosiaris: "Done in https://wikitech.wikimedia.org/w/index.php?title=Puppet_coding&diff=1754530&oldid=1352155" [puppet] - 10https://gerrit.wikimedia.org/r/345538 (owner: 10Alexandros Kosiaris) [12:34:47] hah [12:35:00] maybe system::profile makes more sense? [12:35:17] I get confused a lot because the english words profile and role seem backwards from how puppet actually uses them, but whatever [12:35:49] but if a profile is a singular function of a host, and a role gathers up a set of profiles that are deployed together on a type of multi-function host [12:36:04] and motd was describing the list of functions. it seems like it should suck up multiple system::profile for motd [12:39:03] RECOVERY - puppet last run on mw1214 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [12:48:15] (03PS5) 10Gehel: postgresql - simplify creation of databases [puppet] - 10https://gerrit.wikimedia.org/r/345310 (https://phabricator.wikimedia.org/T157613) [12:49:25] !log rebooting bast4001 for kernel update to 4.9 [12:49:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:14] (03CR) 10Gehel: postgresql - simplify creation of databases (0312 comments) [puppet] - 10https://gerrit.wikimedia.org/r/345310 (https://phabricator.wikimedia.org/T157613) (owner: 10Gehel) [12:50:40] jouncebot: next [12:50:41] In 0 hour(s) and 9 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170330T1300) [12:50:59] (03CR) 10Giuseppe Lavagetto: [C: 031] "two minor comments but LGTM; we will be left with some cleanup to do afterwards, but that's not your current concern." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/333880 (owner: 10Elukey) [12:54:20] <_joe_> elukey: I'd merge that patch and the eventual followup after the EU SWAT [12:55:32] (03PS2) 10Hashar: [cleanup] Remove expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344991 (https://phabricator.wikimedia.org/T161530) (owner: 10Urbanecm) [12:55:34] (03PS2) 10Hashar: Allow eliminators and autoreviewers to move a file on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344989 (https://phabricator.wikimedia.org/T161532) (owner: 10Urbanecm) [12:55:36] (03PS2) 10Hashar: Assign move-categorypages to sysops&bots only on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345093 (https://phabricator.wikimedia.org/T161551) (owner: 10Urbanecm) [12:55:38] (03PS2) 10Hashar: Enable Multimedia Viewer at officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342798 (https://phabricator.wikimedia.org/T160420) (owner: 10Urbanecm) [12:55:39] zeljkof: I have rebased urbanecm patches [12:56:06] hashar, zeljkof: I'm ready for SWATing so if nothing has to be done before SWAT we can start I think [12:56:23] o/ [12:56:35] _joe_ do you mean mine or yours ? :) [12:56:43] <_joe_> yours [12:56:48] hashar: want to do the swat? should I? [12:56:51] <_joe_> and then possibly mine ones [12:56:53] ah okok! Updating the commit msg [12:56:54] zeljkof: can you ? [12:56:58] hashar: sure [12:57:00] got a meeting in half an hour [12:57:03] PROBLEM - puppet last run on mw1220 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:57:08] _joe_ I need to review the last two [12:57:11] but I can assist until that meeting start :] [12:57:18] * zeljkof is getting ready for swat [12:57:27] _joe_ but you can proceed if you want, you know my puppet level :) [12:57:33] I think we can deploy all four patches straight to mwdebug1001 and test there [12:57:40] hashar: it would be great if you could take a look at the patches and +1 them if you think they are fine [12:57:49] one is a throttle cleanup, the three others are each for separate wikis [12:57:51] <_joe_> elukey: I'm involved in SWAT, I have two patches to publish [12:57:52] I did [12:58:01] hashar: great, thanks [12:58:11] _joe_: you are deploying your own patches? [12:58:42] then there are a couple patches to change some hostnames by _joe_ and elukey [12:59:39] <_joe_> zeljkof: well I assumed I was going through the process [12:59:52] <_joe_> but I can merge and deploy them myself [13:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170330T1300). [13:00:04] Urbanecm and _joe_: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:16] _joe_: you can deploy yourself, either before or after the rest of the patches [13:00:24] (03PS14) 10Elukey: role::memcached: refactor in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/333880 [13:00:27] _joe_: you can deploy all patches, if you want to [13:00:50] but I can deploy the rest of the patches, if you prefer that [13:00:53] <_joe_> zeljkof: no thanks, I have other things to do too, go on with the rest of the patches, I'll be here after [13:01:32] _joe_: ok, starting with the swat, will ping you when I am done, so you can take over [13:01:38] I can SWAT today! [13:03:47] addshore: grafana supports more data sources (such as prometheus) and we can't rule out that this leaks PII data [13:03:54] (03PS1) 10BBlack: cxserver: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/345542 [13:03:56] (03PS1) 10BBlack: citoid: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/345543 [13:04:32] (03PS2) 10Gehel: Cirrus / Analytics - remove deprecated rsync job [puppet] - 10https://gerrit.wikimedia.org/r/345362 [13:06:15] Urbanecm: merging and deploying 344991 [13:06:19] zeljkof, ack [13:06:51] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344991 (https://phabricator.wikimedia.org/T161530) (owner: 10Urbanecm) [13:10:07] zeljkof, jenkins don't work? [13:10:19] Urbanecm: looks like jenkins is busy :( [13:10:28] one of the jobs is still in queue [13:10:49] zeljkof, maybe restart of the job would help but I'm not sure. [13:10:50] 06Operations, 10ORES, 10Revision-Scoring-As-A-Service-Backlog, 06Services (designing), 15User-mobrovac: [spec] Active-active setup for ORES across datacenters (eqiad, codfw) - https://phabricator.wikimedia.org/T159615#3143490 (10akosiaris) Sorry for not answering sooner on this. @mobrovac That's an arch... [13:10:52] (03CR) 10Jcrespo: [C: 031] "Not sure if you want it to do it separately. But remember it has to be separated from the hostname -> ip selection too here and on the cod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345535 (https://phabricator.wikimedia.org/T160435) (owner: 10Marostegui) [13:11:35] (03CR) 10Marostegui: "> Not sure if you want it to do it separately. But remember it has to" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345535 (https://phabricator.wikimedia.org/T160435) (owner: 10Marostegui) [13:12:06] (03CR) 10Alexandros Kosiaris: [C: 031] "looks fine to me now (famous last words ;-))" [puppet] - 10https://gerrit.wikimedia.org/r/345310 (https://phabricator.wikimedia.org/T157613) (owner: 10Gehel) [13:12:27] hashar: the first patch, and jenkins is stuck :( [13:13:04] operations-mw-config-composer-hhvm-jessiequeued jobs for 344991,2 is in queue for 5 minutes [13:13:36] (03PS2) 10Giuseppe Lavagetto: role::mediawiki::common: extract scap::proxy to a profile [puppet] - 10https://gerrit.wikimedia.org/r/345527 [13:15:03] hashar: the only thing I see are a couple of core and wikibase jobs consuming a lot of instances [13:15:26] not sure what to do or how to speed things up [13:15:49] (03CR) 10Faidon Liambotis: [C: 04-1] "Also while you're touching this, fix sysctl.pp too:" [puppet] - 10https://gerrit.wikimedia.org/r/345366 (owner: 10Dzahn) [13:16:40] (03CR) 10Gehel: [C: 032] Cirrus / Analytics - remove deprecated rsync job [puppet] - 10https://gerrit.wikimedia.org/r/345362 (owner: 10Gehel) [13:16:50] (03PS1) 10Marostegui: site.pp: Remove db1057 [puppet] - 10https://gerrit.wikimedia.org/r/345545 (https://phabricator.wikimedia.org/T160435) [13:16:51] Urbanecm, hashar the job is finally running [13:16:52] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] role::mediawiki::common: extract scap::proxy to a profile [puppet] - 10https://gerrit.wikimedia.org/r/345527 (owner: 10Giuseppe Lavagetto) [13:16:56] hashar: as FYI https://gerrit.wikimedia.org/r/#/c/333880/ - should be a noop [13:16:57] zeljkof, how do you watch the jenkins status? [13:16:59] (03Merged) 10jenkins-bot: [cleanup] Remove expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344991 (https://phabricator.wikimedia.org/T161530) (owner: 10Urbanecm) [13:17:07] Urbanecm: https://integration.wikimedia.org/zuul/ [13:17:08] (03PS3) 10Giuseppe Lavagetto: role::mediawiki::common: extract scap::proxy to a profile [puppet] - 10https://gerrit.wikimedia.org/r/345527 [13:17:10] (03CR) 10jenkins-bot: [cleanup] Remove expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344991 (https://phabricator.wikimedia.org/T161530) (owner: 10Urbanecm) [13:17:11] zeljkof, thank you [13:17:25] zeljkof: looks good now? [13:17:28] 344991 merged, deploying [13:17:46] hashar: well, the commit got merged, only 10 minutes or so :/ [13:18:13] yeah the Wikibase patches are consuming too many instances [13:18:16] that is being refined [13:18:41] zeljkof: have you CR+2 the other ones? [13:19:07] hashar: not yet, was not sure if that will make things worse :/ [13:19:17] they get enqueued [13:19:28] and processed as instances are made available [13:19:38] !log zfilipin@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:344991|[cleanup] Remove expired rules (T161530)]] (duration: 00m 45s) [13:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:44] T161530: Remove throttle for ptwiki event - https://phabricator.wikimedia.org/T161530 [13:19:57] Urbanecm: 344991 deployed, nothing to check, right? [13:20:29] Urbanecm: the rest of the patches can be tested at mwdebug1002? (once deployed there) [13:20:31] zeljkof, yep. Everything should be caught by filters. [13:20:57] zeljkof, except 342798 as I have no access to officewiki. [13:21:11] (03PS6) 10Gehel: postgresql - simplify creation of databases [puppet] - 10https://gerrit.wikimedia.org/r/345310 (https://phabricator.wikimedia.org/T157613) [13:21:13] (03CR) 10Elukey: [C: 031] "https://puppet-compiler.wmflabs.org/5973/ - seems a noop (changes are related to previous patches from what I can tell)" [puppet] - 10https://gerrit.wikimedia.org/r/345530 (owner: 10Giuseppe Lavagetto) [13:21:13] PROBLEM - puppet last run on mw1295 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:21:17] Urbanecm: ok, will ping you as commits are at mwdebug1002 [13:21:21] (03PS1) 10Faidon Liambotis: mediawiki: kill more precise references [puppet] - 10https://gerrit.wikimedia.org/r/345546 [13:21:23] zeljkof, okay [13:21:23] (03PS1) 10Faidon Liambotis: hhvm: kill a precise reference [puppet] - 10https://gerrit.wikimedia.org/r/345547 [13:21:25] (03PS1) 10Faidon Liambotis: apache: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345548 [13:21:27] (03PS1) 10Faidon Liambotis: installserver: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345549 [13:21:29] (03PS1) 10Faidon Liambotis: aptrepo: remove precise-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/345550 [13:21:31] (03PS1) 10Faidon Liambotis: ganglia: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345551 [13:21:33] (03PS1) 10Faidon Liambotis: elasticsearch: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345552 [13:21:35] (03PS1) 10Faidon Liambotis: noc: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345553 [13:21:37] (03PS1) 10Faidon Liambotis: puppetmaster: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345554 [13:21:39] (03PS1) 10Faidon Liambotis: memcached: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345555 [13:21:41] (03PS1) 10Faidon Liambotis: openstack/nova: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345556 [13:21:43] (03PS1) 10Faidon Liambotis: ci: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345557 [13:21:45] (03PS1) 10Faidon Liambotis: lxc: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345558 [13:21:47] (03PS1) 10Faidon Liambotis: toollabs: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345559 [13:21:49] (03PS1) 10Ottomata: Attempt to run apt-get update before proceeding with installing cdh packages [puppet] - 10https://gerrit.wikimedia.org/r/345560 [13:21:54] (03PS2) 10Marostegui: site.pp,linux-host-entries.ttyS1: Remove db1057 [puppet] - 10https://gerrit.wikimedia.org/r/345545 (https://phabricator.wikimedia.org/T160435) [13:21:56] Seems we'll have some jenkins problems, this will create a lot of jobs... [13:22:01] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344989 (https://phabricator.wikimedia.org/T161532) (owner: 10Urbanecm) [13:22:40] and ops/puppet has priority... :/ [13:22:59] :( [13:23:10] (03CR) 10Elukey: "This one will probably need a broader audience. It looks good but we create a tight dependency with the discovery service in puppet, that " [puppet] - 10https://gerrit.wikimedia.org/r/345531 (owner: 10Giuseppe Lavagetto) [13:23:22] (03PS2) 10Ottomata: Attempt to run apt-get update before proceeding with installing cdh packages [puppet] - 10https://gerrit.wikimedia.org/r/345560 [13:23:27] (03CR) 10Ottomata: [V: 032 C: 032] Attempt to run apt-get update before proceeding with installing cdh packages [puppet] - 10https://gerrit.wikimedia.org/r/345560 (owner: 10Ottomata) [13:23:33] hashar: ok, now a lot of ops/puppet commits are landed... :( [13:23:34] (03CR) 10Muehlenhoff: "That's a duplicate of https://gerrit.wikimedia.org/r/#/c/345502/, but feel free to merge yours." [puppet] - 10https://gerrit.wikimedia.org/r/345546 (owner: 10Faidon Liambotis) [13:25:03] RECOVERY - puppet last run on mw1220 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [13:25:06] Urbanecm: I have no idea when the second commit will get merged... [13:25:16] I will ping you as soon as something happens... [13:25:20] but it might be a while [13:25:29] I have recommended to CR+2 all four patches [13:25:34] so you get CI processing them in the background [13:25:36] Okay, I have time :) [13:25:45] and by the time your first commit is deployed, the others got merged [13:25:48] and won't have to wait on it. [13:25:48] hashar: oh well, looks like I should have done that... [13:25:53] most probably you can just push [13:26:27] hashar: I will +2 them now, looks like there is nothing else to do [13:27:19] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342798 (https://phabricator.wikimedia.org/T160420) (owner: 10Urbanecm) [13:27:34] then you can fetch them on tin as needed and rebase one by one [13:27:38] <_joe_> hashar, zeljkof I would like to avoid auto-merging my patches, btw, so if you want to +2 them (if they don't seem wrong) [13:27:47] or just deploy all those four trivial patches to mwdebug to test them all [13:28:24] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345093 (https://phabricator.wikimedia.org/T161551) (owner: 10Urbanecm) [13:28:38] (03PS2) 10Giuseppe Lavagetto: role::mediawiki::common: convert to profile (step 1) [puppet] - 10https://gerrit.wikimedia.org/r/345528 [13:28:42] (03Merged) 10jenkins-bot: Allow eliminators and autoreviewers to move a file on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344989 (https://phabricator.wikimedia.org/T161532) (owner: 10Urbanecm) [13:28:47] _joe_: I am in a meeting though [13:29:05] _joe_: but looks like Elukey reviewed them and with the mwdebug deploy it should be fine [13:29:30] <_joe_> hashar: elukey just reviewed my puppet patches, not the ones up to deploy, but don't worry [13:29:33] _joe_: I can take a look, but I am not familiar with that code [13:29:45] (03CR) 10jenkins-bot: Allow eliminators and autoreviewers to move a file on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344989 (https://phabricator.wikimedia.org/T161532) (owner: 10Urbanecm) [13:29:50] <_joe_> zeljkof: kk, I'll auto-merge then :P [13:30:19] _joe_: that would actually make more sense, you know more about that than me :) [13:31:15] (03Merged) 10jenkins-bot: Assign move-categorypages to sysops&bots only on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345093 (https://phabricator.wikimedia.org/T161551) (owner: 10Urbanecm) [13:31:31] (03CR) 10jenkins-bot: Assign move-categorypages to sysops&bots only on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345093 (https://phabricator.wikimedia.org/T161551) (owner: 10Urbanecm) [13:31:54] PROBLEM - puppet last run on eeden is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:31:54] (03Merged) 10jenkins-bot: Enable Multimedia Viewer at officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342798 (https://phabricator.wikimedia.org/T160420) (owner: 10Urbanecm) [13:32:56] (03CR) 10Gehel: [C: 031] "LGTM - thanks for the cleanup!" [puppet] - 10https://gerrit.wikimedia.org/r/345552 (owner: 10Faidon Liambotis) [13:33:12] (03CR) 10jenkins-bot: Enable Multimedia Viewer at officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342798 (https://phabricator.wikimedia.org/T160420) (owner: 10Urbanecm) [13:34:43] hashar, Urbanecm: looks like ops/puppet patches were quick, all swat patches are merged, deploying to mwdebug1002 [13:36:02] (03PS1) 10Faidon Liambotis: Standardize on lowercase os_version/require_os [puppet] - 10https://gerrit.wikimedia.org/r/345561 [13:36:19] zeljkof, ack [13:36:37] (03CR) 10Muehlenhoff: "Why is that needed? "apt-get update" is run with every puppet run anyway?" [puppet] - 10https://gerrit.wikimedia.org/r/345560 (owner: 10Ottomata) [13:37:24] Urbanecm: ok, all patches at mwdebug1002 [13:37:38] I will try to check officewiki one (342798) [13:37:50] * Urbanecm is going to test them [13:38:00] let me know when you have tested the other two [13:38:15] volans: yup new backend hosts and expired downtime :( [13:38:31] I figured, no worries [13:39:29] (03PS1) 10Ottomata: Add passwords::mysql::analytics_labsdb class to match prod [labs/private] - 10https://gerrit.wikimedia.org/r/345562 [13:39:46] (03CR) 10Giuseppe Lavagetto: [C: 032] role::mediawiki::common: convert to profile (step 1) [puppet] - 10https://gerrit.wikimedia.org/r/345528 (owner: 10Giuseppe Lavagetto) [13:39:52] (03CR) 10Ottomata: [V: 032 C: 032] Add passwords::mysql::analytics_labsdb class to match prod [labs/private] - 10https://gerrit.wikimedia.org/r/345562 (owner: 10Ottomata) [13:40:01] <_joe_> ahah I E [13:40:07] <_joe_> BEATED YOU OTTO [13:40:31] lol [13:40:51] zeljkof, working, please deploy them to the whole cluster [13:40:54] (03PS1) 10Faidon Liambotis: vagrant: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345563 [13:40:56] (03PS1) 10Faidon Liambotis: interface: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345564 [13:41:29] Urbanecm: will do, just a minute to check officewiki [13:42:34] multimediaviewer works on office wiki, deploying all patches [13:43:36] zeljkof, thank you [13:46:54] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:344989|Allow eliminators and autoreviewers to move a file on ptwiki (T161532)]] [[gerrit:345093|Assign move-categorypages to sysops&bots only on nlwiki (T161551)]] [[gerrit:342798|Enable Multimedia Viewer at officewiki (T160420)]] (duration: 00m 44s) [13:47:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:02] T160420: Install multimediaviewer on office wiki - https://phabricator.wikimedia.org/T160420 [13:47:02] T161551: Remove "move-categorypages" for normal users on nlwiki - https://phabricator.wikimedia.org/T161551 [13:47:02] T161532: Add permission "movefile" to two user groups (ptwiki) - https://phabricator.wikimedia.org/T161532 [13:47:03] PROBLEM - Check Varnish expiry mailbox lag on cp1072 is CRITICAL: CRITICAL: expiry mailbox lag is 647036 [13:47:15] Urbanecm: everything deployed, please check production, I will check officewiki [13:47:22] _joe_: swat is all yours [13:47:24] zeljkof, okay [13:47:26] <_joe_> zeljkof: thanks [13:47:43] _joe_: apologies for the delay, jenkins was busy [13:47:51] <_joe_> yeah not your fault [13:47:59] jouncebot, now [13:47:59] For the next 0 hour(s) and 12 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170330T1300) [13:48:16] (03PS2) 10Giuseppe Lavagetto: Convert parsoid, mathoid, restbase urls to use the discovery hostname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345508 [13:49:11] (03CR) 10Giuseppe Lavagetto: [C: 032] Convert parsoid, mathoid, restbase urls to use the discovery hostname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345508 (owner: 10Giuseppe Lavagetto) [13:49:27] zeljkof, working, thank you! [13:49:49] Urbanecm: I can not get multimedia viewer to work on office wiki :/ [13:50:17] zeljkof, did it work at mwdebug1002? [13:50:33] it was working fine on mwdebug1002 [13:50:41] But why? [13:50:58] that's what is confusing :( [13:51:13] RECOVERY - puppet last run on mw1295 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [13:51:17] you guys still Swatting? [13:51:25] yes [13:51:27] kaldari: _joe_ is [13:51:46] <_joe_> kaldari: yeah I'm merging my own patches, I'm not really a deployer and I have a ton of things to do after these patches [13:51:49] _joe_: Is it too late to add a config change to the swat? [13:51:57] (03Merged) 10jenkins-bot: Convert parsoid, mathoid, restbase urls to use the discovery hostname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345508 (owner: 10Giuseppe Lavagetto) [13:51:58] <_joe_> kaldari: ask zeljkof :) [13:52:07] (03CR) 10jenkins-bot: Convert parsoid, mathoid, restbase urls to use the discovery hostname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345508 (owner: 10Giuseppe Lavagetto) [13:52:08] zeljkof, is https://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php cached or does it reflect current config? [13:52:08] Urbanecm: when I enabled the extension, selected mwdebug1002, clicked on a thumbnail, mvw opened [13:52:19] zeljkof, which is what we want. [13:52:38] kaldari: I am around. want to deploy it yourself, or should I? [13:53:15] Urbanecm: but now without the extension enabled, officewiki just opens an image in a separate page, not in MMV [13:53:19] zeljkof: I can do it myself if you want to take off [13:53:26] (03CR) 10Elukey: "Again really sorry for the delay! If you still have patience I left some comments for this code review, thanks for the time!" (032 comments) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/318053 (owner: 10R4q3NWnUx2CEhVyr) [13:53:46] kaldari: if _joe_ is done with his part of deploy, feel free to do your :) [13:54:06] kaldari: I am around, if you prefer that I do the deploy [13:54:22] zeljkof, is it possible to check if it is really enabled at prod? This is only one thing that came to my mind. [13:54:31] back around [13:54:37] Urbanecm: not sure [13:54:41] anything else left to do still? [13:54:51] hashar: can you check if multimedia viewer is enabled at office wiki? [13:54:55] hashar, we don't know why multimedia viewer work with mwdebug1002 and not at prod. [13:54:56] _joe_: Just let me know when you're done with yours [13:54:58] for multimedia viewer, if it does not work on office wiki it is not a problem [13:55:03] we can just report it is broken on the task [13:55:10] tgr said we can debug it / fix it later [13:55:13] just go to hope page and click an image, it should open in a MMV popup [13:55:18] hashar, okay, I write it there. [13:55:35] hashar: strange thing, it worked on mwdebug1002 :/ [13:55:49] are you sure you did a scap pull on both ? [13:56:07] hashar: nevermind, works now [13:56:13] maybe it needed a few minutes [13:56:16] zeljkof, hashar: So it is working or not? [13:56:24] Urbanecm: MMV works on office wiki [13:56:24] Should I write to the task anything special? [13:56:30] works for me on mwdebug1002 [13:56:35] zeljkof, okay, thank you for deploys and checking! [13:56:58] (03PS1) 10Faidon Liambotis: Switch add_ip6_mapped to use $::interface_primary [puppet] - 10https://gerrit.wikimedia.org/r/345568 [13:57:09] Urbanecm: thanks for the patch :] [13:57:10] (03PS7) 10Gehel: postgresql - simplify creation of databases [puppet] - 10https://gerrit.wikimedia.org/r/345310 (https://phabricator.wikimedia.org/T157613) [13:57:15] Urbanecm: no problem :) thanks for flying with #releng [13:57:17] You're welcome. [13:57:23] zeljkof: so yeah I guess you can sync the multimedia viewer change to the cluster [13:57:31] hashar: it already is [13:57:34] <_joe_> kaldari: I'm merging and testing at the same time, so it will take me some additional time tbh [13:57:35] \o/ [13:57:52] it worked fine on mwdebug, pushed to cluster, but could not verify for a few minutes [13:57:57] works now [13:58:02] _joe_: no worries, take your time [13:58:13] PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 636695 [13:58:18] zeljkof: maybe because of resourceloader cache. I dont quite now it get invalidated [13:58:31] hashar: maybe [14:00:53] RECOVERY - puppet last run on eeden is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [14:01:09] !log oblivian@tin Synchronized wmf-config/ProductionServices.php: switch to discovery for some records (duration: 00m 47s) [14:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:10] (03PS1) 10Faidon Liambotis: Replace $main_ipaddress by $::interface_primary [puppet] - 10https://gerrit.wikimedia.org/r/345569 [14:03:05] (03PS2) 10Giuseppe Lavagetto: Convert cxserver, eventbus to use the discovery url [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345509 [14:03:13] (03CR) 10Giuseppe Lavagetto: [C: 032] Convert cxserver, eventbus to use the discovery url [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345509 (owner: 10Giuseppe Lavagetto) [14:03:23] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 706171 [14:04:30] (03Merged) 10jenkins-bot: Convert cxserver, eventbus to use the discovery url [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345509 (owner: 10Giuseppe Lavagetto) [14:04:43] (03CR) 10jenkins-bot: Convert cxserver, eventbus to use the discovery url [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345509 (owner: 10Giuseppe Lavagetto) [14:04:54] (03CR) 10Gehel: [C: 032] postgresql - simplify creation of databases [puppet] - 10https://gerrit.wikimedia.org/r/345310 (https://phabricator.wikimedia.org/T157613) (owner: 10Gehel) [14:06:09] <_joe_> kaldari: I'm almost done, need to watch logs for a few to be sure nothing is exploding tho [14:06:24] !log oblivian@tin Synchronized wmf-config/ProductionServices.php: switch to discovery for cxserver,eventbus (duration: 00m 43s) [14:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:23] PROBLEM - puppet last run on elastic1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:08:57] (03PS1) 10Gehel: Revert "postgresql - simplify creation of databases" [puppet] - 10https://gerrit.wikimedia.org/r/345572 [14:08:58] <_joe_> kaldari, zeljkof I'm done [14:09:07] (03CR) 10Gehel: [V: 032 C: 032] Revert "postgresql - simplify creation of databases" [puppet] - 10https://gerrit.wikimedia.org/r/345572 (owner: 10Gehel) [14:09:23] RECOVERY - puppet last run on ms-be1032 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [14:09:53] RECOVERY - puppet last run on ms-be1033 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [14:10:29] kaldari: if no other deployment is happening, take over swat [14:10:31] _joe_: Thanks, you have anything else zeljkof or should I go now [14:10:32] ? [14:10:35] Cool [14:10:44] kaldari: I am done [14:11:57] (03PS1) 10Gehel: postgresql - simplify creation of databases [puppet] - 10https://gerrit.wikimedia.org/r/345573 (https://phabricator.wikimedia.org/T157613) [14:13:00] (03CR) 10Kaldari: [C: 032] Enable cookie blocking on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324672 (https://phabricator.wikimedia.org/T152076) (owner: 10Kaldari) [14:13:10] (03PS3) 10Kaldari: Enable cookie blocking on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324672 (https://phabricator.wikimedia.org/T152076) [14:15:47] (03CR) 10R4q3NWnUx2CEhVyr: "The code flow is as follows:" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/318053 (owner: 10R4q3NWnUx2CEhVyr) [14:18:43] (03CR) 10jenkins-bot: Enable cookie blocking on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324672 (https://phabricator.wikimedia.org/T152076) (owner: 10Kaldari) [14:20:59] is swat done? [14:21:06] marostegui: almost [14:21:09] oki! [14:21:48] !log sync InitialiseSettings.php to enable cookie blocking on English Wikipedia [14:21:53] (03PS2) 10Gehel: postgresql - simplify creation of databases [puppet] - 10https://gerrit.wikimedia.org/r/345573 (https://phabricator.wikimedia.org/T157613) [14:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:13] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1090" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345574 [14:22:16] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1090" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345574 [14:22:44] !log kaldari@tin Synchronized wmf-config/InitialiseSettings.php: (no justification provided) (duration: 00m 48s) [14:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:28] marostegui: SWAT is done! [14:27:38] \o/ [14:27:39] thanks! [14:28:02] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1090" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345574 (owner: 10Marostegui) [14:28:04] (03PS1) 10Ottomata: Remove unused statistics::rsync::webrequest class from statistics::private [puppet] - 10https://gerrit.wikimedia.org/r/345578 [14:29:11] (03CR) 10Ottomata: [V: 032 C: 032] Remove unused statistics::rsync::webrequest class from statistics::private [puppet] - 10https://gerrit.wikimedia.org/r/345578 (owner: 10Ottomata) [14:29:17] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1090" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345574 (owner: 10Marostegui) [14:29:29] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1090" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345574 (owner: 10Marostegui) [14:30:11] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1090 - T17441 (duration: 00m 45s) [14:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:18] T17441: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441 [14:30:31] (03PS2) 10Marostegui: db-eqiad.php: Remove db1057 from s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345535 (https://phabricator.wikimedia.org/T160435) [14:31:33] PROBLEM - MariaDB Slave Lag: x1 on db2033 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 446.87 seconds [14:32:30] ^ checking [14:33:11] !log rebooting restbase2001 to Linux 4.9 [14:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:48] !log upgrading nova-compute to 12.0.6 on all labvirts [14:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:51] PROBLEM - swift-object-replicator on ms-be1031 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [14:36:10] PROBLEM - swift-object-server on ms-be1031 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [14:36:50] RECOVERY - puppet last run on elastic1030 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [14:37:00] PROBLEM - puppet last run on ms-be1031 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 5 minutes ago with 5 failures. Failed resources (up to 3 shown): Service[swift-account-replicator],Service[swift-account-reaper],Service[swift-account-auditor],Service[swift-object] [14:37:10] still me ^ silencing [14:37:19] (03CR) 10Elukey: "Completely right, the Gerrit preview fooled me! Thanks for the explanation, will test your code today/tomorrow!" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/318053 (owner: 10R4q3NWnUx2CEhVyr) [14:37:30] PROBLEM - swift-account-auditor on ms-be1031 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [14:37:40] PROBLEM - swift-account-reaper on ms-be1031 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [14:38:00] PROBLEM - swift-account-replicator on ms-be1031 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [14:38:29] !log run stress test (w/ bonnie) on new swift hw - T160640 [14:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:36] T160640: Rack and Setup ms-be1028-ms-1039 - https://phabricator.wikimedia.org/T160640 [14:39:11] (03CR) 10Gehel: [C: 032] postgresql - simplify creation of databases [puppet] - 10https://gerrit.wikimedia.org/r/345573 (https://phabricator.wikimedia.org/T157613) (owner: 10Gehel) [14:39:18] (03PS3) 10Gehel: postgresql - simplify creation of databases [puppet] - 10https://gerrit.wikimedia.org/r/345573 (https://phabricator.wikimedia.org/T157613) [14:40:41] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3143658 (10Papaul) 1st crash Date: October 24, 2016 Troubleshooting : removed both PSU's for a couple of minutes 2nd crash Date Dcember 12... [14:41:06] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 0.1.37 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/345283 (https://phabricator.wikimedia.org/T151553) (owner: 10Gilles) [14:43:40] RECOVERY - MariaDB Slave Lag: x1 on db2033 is OK: OK slave_sql_lag Replication lag: 0.07 seconds [14:45:50] PROBLEM - puppet last run on maps2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:46:50] PROBLEM - puppet last run on wtp1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:48:11] RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 314 [14:48:23] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Remove db1057 from s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345535 (https://phabricator.wikimedia.org/T160435) (owner: 10Marostegui) [14:48:50] PROBLEM - puppet last run on maps1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:49:32] (03Merged) 10jenkins-bot: db-eqiad.php: Remove db1057 from s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345535 (https://phabricator.wikimedia.org/T160435) (owner: 10Marostegui) [14:49:41] (03CR) 10jenkins-bot: db-eqiad.php: Remove db1057 from s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345535 (https://phabricator.wikimedia.org/T160435) (owner: 10Marostegui) [14:49:46] puppet fail on maps* is me, fix coming up [14:50:57] (03PS1) 10Gehel: postgresql - package management has moved to postgresql::server class [puppet] - 10https://gerrit.wikimedia.org/r/345580 [14:50:59] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Remove db1057 entry from s1 shard - T160435 (duration: 00m 44s) [14:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:05] T160435: db1057 does not react to powercycle/powerdown/powerup commands - https://phabricator.wikimedia.org/T160435 [14:54:16] (03CR) 10Gehel: [C: 032] postgresql - package management has moved to postgresql::server class [puppet] - 10https://gerrit.wikimedia.org/r/345580 (owner: 10Gehel) [14:54:40] PROBLEM - puppet last run on maps-test2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:54:51] (03PS1) 10Faidon Liambotis: motd: remove precise, add comments for stretch [puppet] - 10https://gerrit.wikimedia.org/r/345581 [14:55:40] PROBLEM - puppet last run on maps2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:55:50] RECOVERY - puppet last run on maps2002 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [14:59:59] (03CR) 10Hashar: "Standardization!! I found a few more using the perl regex:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/345561 (owner: 10Faidon Liambotis) [15:00:50] RECOVERY - swift-object-replicator on ms-be1031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [15:01:00] RECOVERY - swift-account-replicator on ms-be1031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [15:01:00] RECOVERY - puppet last run on ms-be1031 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [15:01:10] RECOVERY - swift-object-server on ms-be1031 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [15:01:30] RECOVERY - swift-account-auditor on ms-be1031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [15:01:40] RECOVERY - swift-account-reaper on ms-be1031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [15:05:33] (03PS2) 10Faidon Liambotis: Standardize on lowercase os_version/require_os [puppet] - 10https://gerrit.wikimedia.org/r/345561 [15:06:11] (03CR) 10Faidon Liambotis: "Thanks Antoine, a couple of more fixed. There are more that surface with that grep, but they are being fixed by the precise-remnants topic" [puppet] - 10https://gerrit.wikimedia.org/r/345561 (owner: 10Faidon Liambotis) [15:12:00] (03Draft1) 10Paladox: Gerrit: Add log4j.logger.org.apache.sshd.common.keyprovider.FileKeyPairProvider=INFO to log4j [puppet] - 10https://gerrit.wikimedia.org/r/345583 [15:12:03] (03PS2) 10Paladox: Gerrit: Add log4j.logger.org.apache.sshd.common.keyprovider.FileKeyPairProvider=INFO to log4j [puppet] - 10https://gerrit.wikimedia.org/r/345583 [15:13:30] PROBLEM - puppet last run on wdqs1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:14:29] 06Operations, 06Analytics-Kanban, 10DBA: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#3143721 (10Nuria) a:03Ottomata [15:15:28] 06Operations, 06Analytics-Kanban, 10DBA: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#1952524 (10Nuria) Let's take advantage of the fact that after the rename we have now autoincrement ids on new tables . [15:15:50] RECOVERY - puppet last run on wtp1001 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [15:16:00] PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:16:10] (03CR) 10Faidon Liambotis: [C: 032] motd: remove precise, add comments for stretch [puppet] - 10https://gerrit.wikimedia.org/r/345581 (owner: 10Faidon Liambotis) [15:16:29] (03PS2) 10Giuseppe Lavagetto: role::mediawiki::common: convert to profiles (step 2) [puppet] - 10https://gerrit.wikimedia.org/r/345529 [15:17:50] RECOVERY - puppet last run on maps1004 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [15:18:28] (03CR) 10Giuseppe Lavagetto: [C: 032] role::mediawiki::common: convert to profiles (step 2) [puppet] - 10https://gerrit.wikimedia.org/r/345529 (owner: 10Giuseppe Lavagetto) [15:18:45] (03CR) 10Ottomata: [C: 031] Prepare analytics1027 for decommission [puppet] - 10https://gerrit.wikimedia.org/r/345117 (https://phabricator.wikimedia.org/T161597) (owner: 10Elukey) [15:19:16] <_joe_> paravoid: can I merge your change? [15:19:51] yes please [15:21:36] _joe_: ^ [15:21:46] <_joe_> paravoid: already done [15:21:56] do we have a job these days to trigger the puppet compiler and report back if a change is noop? [15:22:02] <_joe_> sorry, I was concentrated on monitoring my change [15:22:18] <_joe_> paravoid: nope [15:22:24] :( [15:22:34] <_joe_> paravoid: it's pretty easy to write using cumin libraries to query puppetdb, too [15:22:40] RECOVERY - puppet last run on maps-test2003 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [15:22:49] <_joe_> to create a host list, I mean [15:23:00] a list of affected hosts you mean? [15:23:03] <_joe_> yes [15:23:36] paravoid: also if the change is only in installed packages might not be reported by the puppet compiler [15:23:57] <_joe_> so the idea CI job would: extract name of changed classes or defines from the changed files [15:24:00] I was thinking of stuff like https://gerrit.wikimedia.org/r/345561 [15:24:35] brb [15:24:40] RECOVERY - puppet last run on maps2003 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [15:25:02] that probably is worth a puppet compiler run without specifying the hosts, will take time though [15:27:04] PROBLEM - nova-conductor process on labcontrol1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-conductor [15:27:10] PROBLEM - DPKG on labcontrol1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:27:20] PROBLEM - nova-scheduler process on labcontrol1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-scheduler [15:27:29] chasemp ^^ [15:27:34] that's paging too [15:27:53] andrewbogott: moritzm are you guys doing labvirt maint still? [15:28:04] yeah, those alerts are me [15:28:10] RECOVERY - nova-conductor process on labcontrol1001 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/nova-conductor [15:28:25] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:28:25] RECOVERY - nova-scheduler process on labcontrol1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-scheduler [15:28:31] shoudl be fixed [15:28:33] bblack: sorry, upgrades in progress I believe and forgot to silence [15:28:59] Yeah, it shouldn't have stopped any services, something busted in the .deb I guess :( [15:29:15] (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner: move role to profile, unify videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/345530 [15:29:15] RECOVERY - DPKG on labcontrol1001 is OK: All packages OK [15:29:25] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [15:29:28] the same process worked without alerts on labtest [15:30:10] <_joe_> andrewbogott: or, when doing it on production and not test you have traffic, load, etc interfering [15:30:13] <_joe_> :) [15:30:59] It was apt-get fussing about a config file — it removed the file and then got upset that it wasn't there. I touched it and then all was well. [15:32:35] (03PS1) 10Tobias Gritschacher: Add an alternative ssh-key for goransm [puppet] - 10https://gerrit.wikimedia.org/r/345587 [15:33:45] PROBLEM - cassandra-c SSL 10.192.16.188:7001 on restbase2010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:33:55] PROBLEM - cassandra-c CQL 10.192.16.188:9042 on restbase2010 is CRITICAL: connect to address 10.192.16.188 and port 9042: Connection refused [15:34:25] PROBLEM - Check systemd state on restbase2010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:35:25] RECOVERY - Check systemd state on restbase2010 is OK: OK - running: The system is fully operational [15:35:45] RECOVERY - cassandra-c SSL 10.192.16.188:7001 on restbase2010 is OK: SSL OK - Certificate restbase2010-c valid until 2017-11-17 00:54:27 +0000 (expires in 231 days) [15:35:55] RECOVERY - cassandra-c CQL 10.192.16.188:9042 on restbase2010 is OK: TCP OK - 0.036 second response time on 10.192.16.188 port 9042 [15:38:08] (03CR) 10GoranSMilovanovic: [C: 031] Add an alternative ssh-key for goransm [puppet] - 10https://gerrit.wikimedia.org/r/345587 (owner: 10Tobias Gritschacher) [15:39:18] (03PS1) 10Andrew Bogott: Include nova's policy.json along with all services. [puppet] - 10https://gerrit.wikimedia.org/r/345588 [15:39:27] (03CR) 10Elukey: "Other comments if you have time :)" (035 comments) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/318053 (owner: 10R4q3NWnUx2CEhVyr) [15:40:21] (03CR) 10jerkins-bot: [V: 04-1] Include nova's policy.json along with all services. [puppet] - 10https://gerrit.wikimedia.org/r/345588 (owner: 10Andrew Bogott) [15:41:35] RECOVERY - puppet last run on wdqs1003 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [15:41:54] urandom: restbase2010 ? [15:42:11] (03PS2) 10BBlack: cxserver: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/345542 [15:42:24] elukey: looking [15:42:32] (03PS2) 10BBlack: citoid: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/345543 [15:42:45] elukey: probably an oom :o [15:42:51] (03PS1) 10BBlack: maps: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/345591 [15:43:08] (03PS1) 10BBlack: swift/upload: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/345592 [15:44:05] subbu: yessss [15:44:11] errr urandom [15:44:17] sorry :) [15:44:41] elukey: yeah, two in a row; so if it keeps doing it we can try dialing down the tombstone threshold until it passes [15:45:05] RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [15:45:55] PROBLEM - cassandra-c CQL 10.192.16.188:9042 on restbase2010 is CRITICAL: connect to address 10.192.16.188 and port 9042: Connection refused [15:46:05] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:46:45] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 48652.559023 Seconds [15:46:55] RECOVERY - cassandra-c CQL 10.192.16.188:9042 on restbase2010 is OK: TCP OK - 0.036 second response time on 10.192.16.188 port 9042 [15:47:05] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [15:47:15] PROBLEM - cassandra-a CQL 10.192.32.137:9042 on restbase2004 is CRITICAL: connect to address 10.192.32.137 and port 9042: Connection refused [15:47:35] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 48709.626435 Seconds [15:47:36] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 48709.629132 Seconds [15:47:45] PROBLEM - cassandra-a SSL 10.192.32.137:7001 on restbase2004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:47:58] ugh [15:48:45] PROBLEM - cassandra-a service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [15:48:45] PROBLEM - Check systemd state on restbase2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:49:45] RECOVERY - cassandra-a service on restbase2004 is OK: OK - cassandra-a is active [15:49:45] RECOVERY - Check systemd state on restbase2004 is OK: OK - running: The system is fully operational [15:49:45] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds [15:50:35] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds [15:50:35] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds [15:50:45] RECOVERY - cassandra-a SSL 10.192.32.137:7001 on restbase2004 is OK: SSL OK - Certificate restbase2004-a valid until 2017-09-12 15:35:23 +0000 (expires in 165 days) [15:51:15] RECOVERY - cassandra-a CQL 10.192.32.137:9042 on restbase2004 is OK: TCP OK - 0.037 second response time on 10.192.32.137 port 9042 [15:51:18] urandom: need help? [15:51:31] volans: with this, or in general? :) [15:51:51] volans: this is a known issue [15:51:52] lol :) with restbase complainin... [15:52:11] and it's happening in codfw which means it won't impact client reads [15:52:14] small favors [15:52:47] some update is tripping over a aberrant data [15:53:20] ok, although not so ok :) [15:53:45] PROBLEM - cassandra-c service on restbase2010 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [15:53:45] PROBLEM - cassandra-c SSL 10.192.16.188:7001 on restbase2010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:53:55] PROBLEM - cassandra-c CQL 10.192.16.188:9042 on restbase2010 is CRITICAL: connect to address 10.192.16.188 and port 9042: Connection refused [15:54:25] PROBLEM - Check systemd state on restbase2010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:54:40] 06Operations, 10Analytics, 07Documentation: Improve SSH access information in onboarding documentation - https://phabricator.wikimedia.org/T160941#3115695 (10Nuria) Can we be specific as to what needs improvements to help ops document what is needed? cc @Zareenf, @Tbayer @mpopov who had had trouble with thi... [15:55:45] RECOVERY - cassandra-c service on restbase2010 is OK: OK - cassandra-c is active [15:56:25] RECOVERY - Check systemd state on restbase2010 is OK: OK - running: The system is fully operational [15:56:45] RECOVERY - cassandra-c SSL 10.192.16.188:7001 on restbase2010 is OK: SSL OK - Certificate restbase2010-c valid until 2017-11-17 00:54:27 +0000 (expires in 231 days) [15:56:55] RECOVERY - cassandra-c CQL 10.192.16.188:9042 on restbase2010 is OK: TCP OK - 0.038 second response time on 10.192.16.188 port 9042 [15:56:56] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 06DC-Ops, 13Patch-For-Review: Decom/Reclaim analytics1027 - https://phabricator.wikimedia.org/T161597#3143852 (10Nuria) [15:57:17] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 06DC-Ops, 13Patch-For-Review: Decom/Reclaim analytics1027 - https://phabricator.wikimedia.org/T161597#3136552 (10Nuria) a:03elukey [15:58:40] (03PS1) 10Gilles: Follow-up fixes for 0.1.37 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/345597 [15:58:50] 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-fgiunchedi: Rack and Setup ms-be1028-ms-1039 - https://phabricator.wikimedia.org/T160640#3143859 (10fgiunchedi) a:05Cmjohnson>03fgiunchedi [15:59:15] !log twentyafterfour@tin Synchronized php-1.29.0-wmf.17/includes/: sync I7c5c0a9b1af99ce2b5f4bdcc99710d8400ca8bcf refs T159319 (duration: 01m 41s) [15:59:19] (03CR) 10Filippo Giunchedi: [C: 032] Follow-up fixes for 0.1.37 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/345597 (owner: 10Gilles) [15:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:04] godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170330T1600). [16:00:14] (03PS1) 10Gilles: Remove trailing slash from SWIFT_API_PATH in Thumbor config [puppet] - 10https://gerrit.wikimedia.org/r/345598 [16:01:12] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: CODFW: 6 Nodes for Kafka refresh/upgrade - https://phabricator.wikimedia.org/T161637#3138006 (10Nuria) p:05Triage>03Normal [16:01:46] PROBLEM - Apache HTTP on mw1191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:02:05] PROBLEM - Nginx local proxy to apache on mw1191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:02:05] PROBLEM - HHVM rendering on mw1191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:02:21] (03PS3) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner: move role to profile, unify videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/345530 [16:03:09] <_joe_> !log restarting hhvm on mw1191, stuck in HPHP::Treadmill::getAgeOldestRequest [16:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:55] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::mediawiki::jobrunner: move role to profile, unify videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/345530 (owner: 10Giuseppe Lavagetto) [16:06:55] PROBLEM - Check systemd state on thumbor1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:07:07] (03PS2) 10Filippo Giunchedi: Remove trailing slash from SWIFT_API_PATH in Thumbor config [puppet] - 10https://gerrit.wikimedia.org/r/345598 (owner: 10Gilles) [16:07:55] RECOVERY - Check systemd state on thumbor1001 is OK: OK - running: The system is fully operational [16:11:06] (03CR) 10Filippo Giunchedi: [C: 032] Remove trailing slash from SWIFT_API_PATH in Thumbor config [puppet] - 10https://gerrit.wikimedia.org/r/345598 (owner: 10Gilles) [16:17:21] !log upgrade thumbor to 0.1.37 on thumbor100[12] [16:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:39] (03PS2) 10Filippo Giunchedi: rancid: create 'configs' directory [puppet] - 10https://gerrit.wikimedia.org/r/345369 [16:19:15] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] rancid: create 'configs' directory [puppet] - 10https://gerrit.wikimedia.org/r/345369 (owner: 10Filippo Giunchedi) [16:20:06] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems in production - https://phabricator.wikimedia.org/T123525#3143911 (10fgiunchedi) [16:20:08] 06Operations, 13Patch-For-Review, 15User-fgiunchedi: upgrade netmon1001 to jessie - https://phabricator.wikimedia.org/T125020#3143909 (10fgiunchedi) 05Open>03Resolved This is completed! [16:20:32] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Add request URL to thumbor errors - https://phabricator.wikimedia.org/T151553#3143912 (10Gilles) 05Open>03Resolved I'll actually enable them when there's ELK integration. Right now added to the existing log entries, it would be too verbose. [16:28:57] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/345588 (owner: 10Andrew Bogott) [16:30:08] (03CR) 10jerkins-bot: [V: 04-1] Include nova's policy.json along with all services. [puppet] - 10https://gerrit.wikimedia.org/r/345588 (owner: 10Andrew Bogott) [16:31:50] 06Operations, 10hardware-requests: codfw: (1) netmon system - https://phabricator.wikimedia.org/T161807#3143960 (10fgiunchedi) [16:33:19] PROBLEM - puppet last run on logstash1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:33:39] PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [16:34:00] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345603 (https://phabricator.wikimedia.org/T128546) [16:34:16] (03PS2) 10Gilles: Make Thumbor connect to Swift via https [puppet] - 10https://gerrit.wikimedia.org/r/343263 (https://phabricator.wikimedia.org/T160670) [16:37:29] 06Operations, 10Analytics, 07Documentation: Improve SSH access information in onboarding documentation - https://phabricator.wikimedia.org/T160941#3144005 (10Tbayer) >>! In T160941#3143836, @Nuria wrote: > Can we be specific as to what needs improvements to help ops document what is needed? cc @Zareenf, @Tb... [16:37:36] (03CR) 10Filippo Giunchedi: [C: 032] Make Thumbor connect to Swift via https [puppet] - 10https://gerrit.wikimedia.org/r/343263 (https://phabricator.wikimedia.org/T160670) (owner: 10Gilles) [16:39:02] 06Operations, 10Analytics-EventLogging, 06Analytics-Kanban, 10DBA: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#3144009 (10Nuria) [16:39:26] (03PS2) 10Gehel: postgresql - fix tests [puppet] - 10https://gerrit.wikimedia.org/r/345338 [16:39:39] (03PS2) 10Filippo Giunchedi: Use proper proxy_next_upstream configuration for Thumbor's nginx [puppet] - 10https://gerrit.wikimedia.org/r/345315 (https://phabricator.wikimedia.org/T161613) (owner: 10Gilles) [16:40:24] (03CR) 10Gehel: [C: 04-1] "on hold until we check things with the maps team" [puppet] - 10https://gerrit.wikimedia.org/r/345591 (owner: 10BBlack) [16:41:11] (03PS2) 10Andrew Bogott: Include nova's policy.json along with all services. [puppet] - 10https://gerrit.wikimedia.org/r/345588 [16:41:32] (03PS3) 10Gehel: postgresql - fix tests [puppet] - 10https://gerrit.wikimedia.org/r/345338 [16:41:51] (03CR) 10Filippo Giunchedi: [C: 032] Use proper proxy_next_upstream configuration for Thumbor's nginx [puppet] - 10https://gerrit.wikimedia.org/r/345315 (https://phabricator.wikimedia.org/T161613) (owner: 10Gilles) [16:43:12] (03CR) 10jerkins-bot: [V: 04-1] Include nova's policy.json along with all services. [puppet] - 10https://gerrit.wikimedia.org/r/345588 (owner: 10Andrew Bogott) [16:43:19] (03PS3) 10Andrew Bogott: Include nova's policy.json along with all services. [puppet] - 10https://gerrit.wikimedia.org/r/345588 [16:45:24] (03PS1) 10Gilles: Follow-up fix for hasattr on early error (0.1.37) [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/345608 [16:45:59] (03CR) 10jerkins-bot: [V: 04-1] Include nova's policy.json along with all services. [puppet] - 10https://gerrit.wikimedia.org/r/345588 (owner: 10Andrew Bogott) [16:46:19] (03PS4) 10Andrew Bogott: Include nova's policy.json along with all services. [puppet] - 10https://gerrit.wikimedia.org/r/345588 [16:47:30] (03PS4) 10Gehel: postgresql - fix tests [puppet] - 10https://gerrit.wikimedia.org/r/345338 [16:47:34] 06Operations, 10Analytics, 07Documentation: Improve SSH access information in onboarding documentation - https://phabricator.wikimedia.org/T160941#3144130 (10Nuria) This task will be seen by the ops on cleaning duty next week. It will help to have a list of issues so they can know what problems the documenta... [16:48:32] (03CR) 10Andrew Bogott: [C: 032] Include nova's policy.json along with all services. [puppet] - 10https://gerrit.wikimedia.org/r/345588 (owner: 10Andrew Bogott) [16:49:20] (03PS5) 10Gehel: postgresql - fix tests [puppet] - 10https://gerrit.wikimedia.org/r/345338 [16:50:02] 06Operations, 07Documentation: Improve SSH access information in onboarding documentation - https://phabricator.wikimedia.org/T160941#3144138 (10Nuria) [16:50:58] (03CR) 10Gehel: [C: 032] postgresql - fix tests [puppet] - 10https://gerrit.wikimedia.org/r/345338 (owner: 10Gehel) [16:51:29] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1002.eqiad.wmnet because of too many down! [16:51:49] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:52:24] godog: expected? [16:52:29] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [16:53:40] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [16:53:59] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1002.eqiad.wmnet because of too many down! [16:54:08] mhh no gilles, checking [16:54:31] the retries working properly now probably increases the workload [16:54:39] (03Abandoned) 10Rush: WIP: labstore: nfs-mounts.yaml per role and nfs-manage-mounts adjust [puppet] - 10https://gerrit.wikimedia.org/r/345168 (https://phabricator.wikimedia.org/T158883) (owner: 10Rush) [16:55:29] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1001.eqiad.wmnet because of too many down! [16:59:18] 06Operations, 10Traffic: Investigate 502 errors from nginx, when backend returns 302 - https://phabricator.wikimedia.org/T161819#3144203 (10EBernhardson) [17:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170330T1700). [17:00:37] 06Operations, 10Traffic: Investigate 502 errors from nginx, when backend returns 302 - https://phabricator.wikimedia.org/T161819#3144218 (10EBernhardson) [17:00:46] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 210 bytes in 0.883 second response time [17:00:54] no parsoid deploy today [17:01:06] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [17:01:06] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [17:01:16] RECOVERY - puppet last run on logstash1003 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [17:01:59] 06Operations, 10Traffic: Investigate 502 errors from nginx, when backend returns 302 - https://phabricator.wikimedia.org/T161819#3144203 (10EBernhardson) [17:03:11] 06Operations, 10Traffic, 10Wikimedia-Logstash: Investigate 502 errors from nginx when backend returns 302 - https://phabricator.wikimedia.org/T161819#3144248 (10EBernhardson) [17:04:06] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1001.eqiad.wmnet because of too many down! [17:05:06] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [17:05:56] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:28] still looking into thumbor/nginx, will downtime that ^ [17:07:06] RECOVERY - Check Varnish expiry mailbox lag on cp1072 is OK: OK: expiry mailbox lag is 41529 [17:08:06] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1001.eqiad.wmnet because of too many down! [17:09:06] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [17:10:41] (03PS10) 10Gehel: maps - cleartables osm replication [puppet] - 10https://gerrit.wikimedia.org/r/341563 (https://phabricator.wikimedia.org/T157613) [17:10:55] (03CR) 10Gehel: "rebased" [puppet] - 10https://gerrit.wikimedia.org/r/341563 (https://phabricator.wikimedia.org/T157613) (owner: 10Gehel) [17:13:56] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1001.eqiad.wmnet because of too many down! [17:15:46] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 210 bytes in 0.002 second response time [17:15:57] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [17:18:57] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1001.eqiad.wmnet because of too many down! [17:19:36] PROBLEM - Apache HTTP on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:19:46] PROBLEM - HHVM rendering on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:19:56] PROBLEM - Nginx local proxy to apache on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:20:26] PROBLEM - Check systemd state on thumbor1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:20:56] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [17:21:51] (03PS1) 10Gilles: Revert "Make Thumbor connect to Swift via https" [puppet] - 10https://gerrit.wikimedia.org/r/345615 [17:22:19] (03PS2) 10Filippo Giunchedi: Revert "Make Thumbor connect to Swift via https" [puppet] - 10https://gerrit.wikimedia.org/r/345615 (owner: 10Gilles) [17:22:26] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Revert "Make Thumbor connect to Swift via https" [puppet] - 10https://gerrit.wikimedia.org/r/345615 (owner: 10Gilles) [17:24:26] RECOVERY - Check systemd state on thumbor1002 is OK: OK - running: The system is fully operational [17:24:30] 06Operations, 06Commons, 10Traffic, 10Wikimedia-Site-requests, and 2 others: Allow anonymous users to change interface language on Commons with ULS - https://phabricator.wikimedia.org/T161517#3144389 (10matmarex) I think that's a different uselang hack. [17:28:38] cmjohnson1: hi! [17:29:06] hi [17:29:21] was it today that we scheduled the test for thermal paste? [17:29:37] yes it was....i forgot to put on my cal...do you have time now? [17:29:45] sure if you have it! [17:29:50] yes [17:29:54] I'll turn off analytics1039 [17:32:34] 06Operations, 06Commons, 10Traffic, 10Wikimedia-Site-requests, and 2 others: Allow anonymous users to change interface language on Commons with ULS - https://phabricator.wikimedia.org/T161517#3144415 (10Steinsplitter) >>! In T161517#3144389, @matmarex wrote: > I think that's a different uselang hack. Yepp... [17:32:55] !log shutdown analytics1039 to apply new thermal paste - T132256 [17:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:01] T132256: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256 [17:33:34] cmjohnson1: it should drain in 5 mins, will ping you when ready! [17:38:08] 06Operations, 07HHVM: HHVM 3.18 deadlocks after 4-6 hours (stuck in in HPHP::Treadmill::getAgeOldestRequest() ) - https://phabricator.wikimedia.org/T161684#3144442 (10MoritzMuehlenhoff) I suspected this might crash due to our setting of kernel.perf_event_paranoid=3 and some kind of resource leak due to failing... [17:41:36] PROBLEM - Check systemd state on mw1261 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:41:56] PROBLEM - HHVM processes on mw1261 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [17:43:26] RECOVERY - Apache HTTP on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.092 second response time [17:43:36] RECOVERY - Check systemd state on mw1261 is OK: OK - running: The system is fully operational [17:43:36] RECOVERY - HHVM rendering on mw1261 is OK: HTTP OK: HTTP/1.1 200 OK - 73923 bytes in 4.070 second response time [17:43:46] RECOVERY - Nginx local proxy to apache on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.242 second response time [17:43:47] (03PS1) 1020after4: Phab: User base:service_unit for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/345617 (https://phabricator.wikimedia.org/T112765) [17:43:56] RECOVERY - HHVM processes on mw1261 is OK: PROCS OK: 6 processes with command name hhvm [17:45:02] cmjohnson1: analytics1039 just shutdown :) [17:45:09] yep [17:45:42] (03PS1) 1020after4: Phab: create some task types and corresponding custom fields. [puppet] - 10https://gerrit.wikimedia.org/r/345618 [17:50:54] (03PS2) 10Dduvall: ci: Docker registry for container builds [puppet] - 10https://gerrit.wikimedia.org/r/345422 (https://phabricator.wikimedia.org/T161657) [17:56:36] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 297699 [17:57:00] elukey: powering up now [17:57:23] nice! [17:58:07] I'll monitor the mcelog during the next days [17:58:58] sounds good [18:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170330T1800). Please do the needful. [18:00:04] jan_drewniak: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:14] (03PS1) 10Ladsgroup: Enable ORES review tool in hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345619 (https://phabricator.wikimedia.org/T161621) [18:00:48] cmjohnson1: thanks a lot! [18:01:05] yw...sorry i forgot this mornign [18:01:39] no problem :) [18:01:55] If there is time, I added one thing to swat [18:02:33] guess I can swat [18:03:16] 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 15User-Elukey: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#3144546 (10elukey) Chris applied the thermal paste and the host is up and running again. Will watch mcelog during the next days to see if... [18:05:13] jan_drewniak, your instructions are wrong. sync-portals should be the thing to deploy with, not run after deploy [18:06:16] (03CR) 10MaxSem: [C: 032] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345603 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [18:07:29] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345603 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [18:07:38] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345603 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [18:08:56] PROBLEM - puppet last run on db1075 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:09:32] jan_drewniak, pulled on mwdebug1002 [18:12:26] PROBLEM - puppet last run on mc1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:13:48] jan_drewniak, how does it look? [18:14:54] MaxSem: one sec [18:15:53] MaxSem: yup, looks good [18:16:58] (03PS2) 1020after4: Phab: Use base:service_unit for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/345617 (https://phabricator.wikimedia.org/T112765) [18:17:34] (03PS3) 1020after4: Phab: Use base:service_unit for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/345617 (https://phabricator.wikimedia.org/T112765) [18:17:38] !log maxsem@tin Synchronized portals/prod/wikipedia.org/assets: (no justification provided) (duration: 00m 45s) [18:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:22] !log maxsem@tin Synchronized portals: (no justification provided) (duration: 00m 44s) [18:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:39] jan_drewniak, deployed, please test [18:18:52] 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3144577 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts: ``` ['analytics1046.eqiad.wmnet'] ``` The log can b... [18:21:40] MaxSem: mwdebug1002 and prod are still different... last time we had this issue it was because the date-modified timestamp on the portals repo didn't changed after it was updated, (and thus the files weren't copied over) [18:22:28] the solution was to `touch portals`and rerun the script [18:23:44] !log maxsem@tin Synchronized portals/prod/wikipedia.org/assets: (no justification provided) (duration: 00m 44s) [18:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:28] !log swift eqiad-prod add ms-be1028 -> ms-be1039 - T160640 [18:24:28] !log maxsem@tin Synchronized portals: (no justification provided) (duration: 00m 44s) [18:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:34] T160640: Rack and Setup ms-be1028-ms-1039 - https://phabricator.wikimedia.org/T160640 [18:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:40] jan_drewniak, ^ [18:27:05] MaxSem: oy, still different... [18:28:21] !log maxsem@tin Synchronized portals/prod/wikipedia.org/assets: (no justification provided) (duration: 00m 44s) [18:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:07] !log maxsem@tin Synchronized portals: (no justification provided) (duration: 00m 45s) [18:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:38] jan_drewniak, ^ [18:32:48] MaxSem: can you manually compare the files on staging and production? I'm still seeing the old version in production :/ [18:33:43] maybe it's varnish cache? [18:34:10] (a wild idea, I have no idea how portals work) [18:34:16] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [18:35:05] Amir1: I have no idea how varnish works :P [18:35:33] fuck, appservers have older files... [18:35:45] https://wikitech.wikimedia.org/wiki/Varnish [18:35:51] This might help :) [18:36:34] despite even everythng having been touched [18:36:57] RECOVERY - puppet last run on db1075 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [18:38:06] !log maxsem@tin Synchronized portals: (no justification provided) (duration: 00m 44s) [18:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:53] anyone knows what's going on with scap? [18:39:02] bd808, RainbowSprinkles ^? [18:40:26] RECOVERY - puppet last run on mc1031 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [18:40:49] MaxSem: I've been on vacation for the past week, so no [18:41:06] woot, congrats on vacation! [18:43:23] MaxSem: sync-dir was recently changed to sync-file in the script, but it seemed like that should improve things https://github.com/wikimedia/portals/commit/ad6eaef5090ef7f706989123a5fcfe6ea22fcb4c [18:44:43] trying with sync-dir [18:45:06] !log maxsem@tin Synchronized portals/prod/wikipedia.org/assets: (no justification provided) (duration: 00m 43s) [18:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:50] !log maxsem@tin Synchronized portals: (no justification provided) (duration: 00m 44s) [18:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:21] dunno, scap is broken? [18:47:49] jan_drewniak, I'm going to revert, unless there's another idea [18:48:15] MaxSem: sync-dir and sync-file are the same thing fwiw [18:48:22] sync-dir is a hidden back-compat alias now [18:48:49] did something break in process? :} [18:49:07] That code has been live for like a month now, so doubt it [18:50:11] dunno, reverting and filing bug [18:51:49] Several days ago when I wanted to deploy Wikidata extension, sync-dir didn't do anything when I tried to syn-dir the whole extension (there were lots of files touched). but when I did sync-dir and sync-file on smaller directories it worked [18:51:50] HTH [18:52:16] MaxSem: Yeah, probably deserves a phab task, because this happened last time (touch fixed it then) [18:52:44] grrr [18:54:55] !log maxsem@tin Synchronized portals/: (no justification provided) (duration: 00m 48s) [18:55:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:31] !log Portals were not deployed: https://phabricator.wikimedia.org/T161832 [18:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:05] RainbowSprinkles: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170330T1900). Please do the needful. [19:01:09] so I think my patch in SWAT is not done :( [19:01:33] :( [19:02:03] I will take it to the next window [19:05:31] This image acts strange: https://commons.wikimedia.org/wiki/File:Fawiki500k_celebration_by_Behdad_Abedi_(180).jpg [19:05:59] it doesn't show anything and clicking on "original file" gives 404 [19:06:00] https://upload.wikimedia.org/wikipedia/commons/e/e6/Fawiki500k_celebration_by_Behdad_Abedi_%28180%29.jpg [19:06:15] I don't know where should I talk about it though [19:25:05] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 to wmf.18 [19:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:33] PROBLEM - carbon-local-relay metric drops on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [100.0] [19:26:33] RECOVERY - carbon-local-relay metric drops on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [19:30:13] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 385 bytes in 0.010 second response time [19:31:13] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [19:32:13] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [19:39:13] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 3670 bytes in 0.007 second response time [21:02:44] (03PS1) 10Niharika29: Test LoginNotify on Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345726 (https://phabricator.wikimedia.org/T158878) [21:05:18] (03CR) 10Paladox: "@demon / @RainbowSprinkles / @Chad / @Dzahn This may not be needed after all :)" [puppet] - 10https://gerrit.wikimedia.org/r/343736 (owner: 10Paladox) [21:05:51] paladox: Ugh please do not ping me 3 different ways like that. I get e-mail notifications. [21:06:00] oh sorry [21:07:49] !log demon@tin Synchronized php-1.29.0-wmf.18/extensions/Echo/includes/model/Event.php: fix logging class reference (duration: 00m 47s) [21:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:50] (03PS1) 10Chad: group2 to wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345745 [21:16:04] 06Operations, 10Traffic, 10Wikimedia-Logstash: Investigate 502 errors from nginx when backend returns 302 - https://phabricator.wikimedia.org/T161819#3145357 (10BBlack) Ok, I was wrong in my initial thinking. Even though we configure `proxy_buffering off;`, `proxy_buffer_size` is still a factor. Technicall... [21:17:15] (03PS1) 10Jdlrobson: Reflect change in purpose of RelatedArticlesFooterBlacklistedSkins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345758 (https://phabricator.wikimedia.org/T160076) [21:18:05] 06Operations, 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure: Update npm to 3 or 4 - https://phabricator.wikimedia.org/T155488#3145376 (10Krinkle) [21:20:44] (03PS1) 10BBlack: tlsproxy: double the response buffer size [puppet] - 10https://gerrit.wikimedia.org/r/345767 (https://phabricator.wikimedia.org/T161819) [21:25:13] (03CR) 10Chad: [C: 032] group2 to wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345745 (owner: 10Chad) [21:26:19] (03Merged) 10jenkins-bot: group2 to wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345745 (owner: 10Chad) [21:27:44] (03CR) 10jenkins-bot: group2 to wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345745 (owner: 10Chad) [21:34:04] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group2 -> wmf.18 [21:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:19] (03CR) 10Dzahn: [C: 031] Remove a number of obsolete conditionals from mediawiki classes [puppet] - 10https://gerrit.wikimedia.org/r/345502 (owner: 10Muehlenhoff) [22:03:15] 06Operations, 10media-storage: 404 error while accessing some djvu files - https://phabricator.wikimedia.org/T161864#3145437 (10Paladox) [22:03:18] 06Operations, 10media-storage: 404 error while accessing some djvu files - https://phabricator.wikimedia.org/T161864#3145480 (10Wieralee) And another one: https://pl.wikisource.org/wiki/Strona:Wykolejony_(Gruszecki)_24.jpg [22:04:17] 06Operations, 10media-storage: 404 error while accessing some djvu files - https://phabricator.wikimedia.org/T161864#3145482 (10Paladox) Another user reported this on irc [20:05:31] This image acts strange: https://commons.wikimedia.org/wiki/File:Fawiki500k_celebration_by_Behdad_Abedi_(180).jpg [20:0... [22:05:19] 06Operations, 10media-storage: 404 error while accessing some images files e.g. djvu and jpg - https://phabricator.wikimedia.org/T161864#3145483 (10Paladox) [22:05:33] PROBLEM - carbon-local-relay metric drops on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [100.0] [22:06:33] RECOVERY - carbon-local-relay metric drops on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [22:19:44] (03PS2) 10Dzahn: admin: Add an alternative ssh-key for goransm [puppet] - 10https://gerrit.wikimedia.org/r/345587 (owner: 10Tobias Gritschacher) [22:20:16] (03PS3) 10Dzahn: admin: Add an alternative ssh-key for goransm [puppet] - 10https://gerrit.wikimedia.org/r/345587 (owner: 10Tobias Gritschacher) [22:21:32] 06Operations, 10media-storage: 404 error while accessing some images files e.g. djvu and jpg - https://phabricator.wikimedia.org/T161864#3145501 (10Wieralee) Another file has disappeared: https://commons.wikimedia.org/wiki/File:PL_J%C3%B3zef_Ignacy_Kraszewski-Poezye_tom_2.djvu [22:23:47] (03CR) 10Dzahn: [C: 032] admin: Add an alternative ssh-key for goransm [puppet] - 10https://gerrit.wikimedia.org/r/345587 (owner: 10Tobias Gritschacher) [22:25:03] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:27:20] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to hive for joewalsh - https://phabricator.wikimedia.org/T161663#3145503 (10Dzahn) a:05Dzahn>03Fjalapeno Hi, @Fjalapeno we'll need your approval for this access request for Joe Walsh please. Could you add it and assign the ticket... [22:27:59] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Production shell access (request for researchers for pmiazga) - https://phabricator.wikimedia.org/T161658#3145507 (10Dzahn) a:05Dzahn>03pmiazga [22:28:03] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [22:37:42] (03PS2) 10Dzahn: base: sysctl/check_puppetrun: remove precise remnants [puppet] - 10https://gerrit.wikimedia.org/r/345366 [22:38:21] There are some recent reports about 404 images on Commons in https://phabricator.wikimedia.org/T161864 - is that known? [22:42:27] andre__: i think it's known that it happens "sometimes" [22:43:18] mutante: Yeah I was wondering if it's maybe "more recently more popular" as there are several reports (and someone set UBN priority and I wonder if that's justified) [22:43:56] 06Operations, 10media-storage: 404 error while accessing some images files e.g. djvu and jpg - https://phabricator.wikimedia.org/T161864#3145530 (10saper) Same as T161864 reported earlier by @Amire80 [22:46:15] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files e.g. djvu and jpg - https://phabricator.wikimedia.org/T161836#3145534 (10saper) p:05Triage>03Unbreak! [22:46:46] I'm not sure it's justified as UBN, but w/e [22:47:22] * RainbowSprinkles sips a drink from his armchair [22:47:37] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files e.g. djvu and jpg - https://phabricator.wikimedia.org/T161836#3145541 (10saper) [22:47:40] 06Operations, 10media-storage: 404 error while accessing some images files e.g. djvu and jpg - https://phabricator.wikimedia.org/T161864#3145543 (10saper) [22:47:59] andre__: hmm, i tried to find something, i found just the tracking bug and we had this recently https://phabricator.wikimedia.org/T161360 but that was different all thumbs [22:48:24] mutante, Hmmm.... Okay thanks for checking :-/ [22:48:26] https://phabricator.wikimedia.org/T43371 [22:48:32] tracking thing.. afraid that's what i have right now [22:52:05] it's probably "High" but not quite "UBN" [22:54:55] missing files totally stopped work on plwikisource [22:56:33] Well, the bug makes it sound like some files are missing, not all [22:56:47] If *all* files are missing, that's UBN [22:57:41] when we can expect the files being restored to be able to continue work? [22:59:42] hi [23:00:01] i would like to add something to swat, but will take me a few minutes to prepare [23:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170330T2300). Please do the needful. [23:00:04] RoanKattouw: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:08] (can deploy myself, if easier) [23:00:27] I'm here [23:00:48] aude: OK go ahead and do that after I'm done? [23:00:50] ankry: Well, that all depends on what's broken with it. Nobody knows yet :) [23:00:55] So, could be 5 minutes, could be 5 years! [23:00:57] :D [23:01:29] sad to sey people they should stop work for 5 years [23:01:44] they will definitely leave the project [23:01:56] anything in some logs? [23:02:01] ankry: My point is: it's impossible to give an ETA until someone debugs it. [23:02:23] Also: is it affecting all djvu files, or just some? The bug makes it sound like it's just *some* [23:02:32] RoanKattouw: ok [23:02:49] always takes some time to update teh wikidata build and wait for jenkins [23:02:53] those that were worked on [23:02:53] PROBLEM - puppet last run on kafka1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:02:54] saper: Nothing standing out, but we've had some weird un-parseable djvu errors I've seen the last few weeks/months [23:03:12] RainbowSprinkles: those that needed thumbnail generation [23:03:14] Basically error message saying "" and nothing else [23:03:14] and cherry pick [23:03:27] also some jpg [23:07:49] !log catrope@tin Synchronized php-1.29.0-wmf.18/extensions/ORES/modules/: T161706 (duration: 00m 51s) [23:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:59] T161706: Review ORES prediction visibility on wikis where they are enabled by default - https://phabricator.wikimedia.org/T161706 [23:08:17] OK, I'm done [23:08:54] ok [23:09:17] waiting for jenkins... [23:12:36] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files e.g. djvu and jpg - https://phabricator.wikimedia.org/T161836#3145574 (10Ankry) Files from T161864 that disappeared: https://commons.wikimedia.org/wiki/File:Andrzej_Kijowski_-_Listopadowy_wiecz%C3%B3r.djvu https://com... [23:30:53] RECOVERY - puppet last run on kafka1003 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [23:34:31] updateing the wikidata build now [23:38:41] (03PS2) 10Gilles: Follow-up fix for hasattr on early error (0.1.37) [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/345608 [23:39:23] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:39:38] (03PS3) 10Gilles: Upgrade to 0.1.38 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/345608 [23:40:13] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.003 second response time [23:49:56] ready to deploy [23:52:35] testing on mwdebug1002 [23:55:05] looks good [23:55:55] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files e.g. djvu and jpg - https://phabricator.wikimedia.org/T161836#3144779 (10EBernhardson) Random debugging: ``` hphpd> $f = wfFindFile(Title::newFromText('File:Fawiki500k_celebration_by_Behdad_Abedi_(180).jpg')) $f = wfFin... [23:58:15] !log aude@tin Synchronized php-1.29.0-wmf.18/extensions/Wikidata: Fixes for special pages (duration: 02m 15s) [23:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:59] done