[00:53:08] PROBLEM - puppet last run on elastic1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:10:21] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 715 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3040774 keys - replication_delay is 715 [01:11:44] PROBLEM - puppet last run on db1078 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:17:04] RECOVERY - puppet last run on elastic1040 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [01:35:31] RECOVERY - puppet last run on db1078 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [02:25:57] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.21) (duration: 09m 40s) [02:26:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:30:47] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Oct 10 02:30:47 UTC 2016 (duration 4m 51s) [02:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:35:59] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [50.0] [02:41:17] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [03:44:32] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] [03:49:43] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [03:50:29] PROBLEM - puppet last run on elastic1027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:03:08] PROBLEM - MariaDB Slave IO: es2 on es2015 is CRITICAL: CRITICAL slave_io_state could not connect [04:04:17] PROBLEM - mysqld processes on es2015 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [04:04:33] PROBLEM - MariaDB Slave SQL: es2 on es2015 is CRITICAL: CRITICAL slave_sql_state could not connect [04:04:50] PROBLEM - MariaDB Slave IO: es2 on es2016 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@es2015.codfw.wmnet:3306 - retry-time: 60 retries: 86400 message: Cant connect to MySQL server on es2015.codfw.wmnet (111 Connection refused) [04:05:30] PROBLEM - puppet last run on es2015 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[pt-heartbeat] [04:05:58] PROBLEM - MariaDB Slave IO: es2 on es2014 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@es2015.codfw.wmnet:3306 - retry-time: 60 retries: 86400 message: Cant connect to MySQL server on es2015.codfw.wmnet (111 Connection refused) [04:06:21] PROBLEM - MariaDB Slave IO: es2 on es1015 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@es2015.codfw.wmnet:3306 - retry-time: 60 retries: 86400 message: Cant connect to MySQL server on es2015.codfw.wmnet (111 Connection refused) [04:09:38] RECOVERY - mysqld processes on es2015 is OK: PROCS OK: 1 process with command name mysqld [04:09:52] well i didnt do shit to that... [04:10:00] i popped online and the clear came up =P [04:11:05] its a master. [04:11:06] I did [04:11:19] RECOVERY - MariaDB Slave IO: es2 on es2014 is OK: OK slave_io_state Slave_IO_Running: Yes [04:11:19] RECOVERY - MariaDB Slave IO: es2 on es2015 is OK: OK slave_io_state Slave_IO_Running: Yes [04:11:22] the whole server crashed [04:11:34] RECOVERY - MariaDB Slave IO: es2 on es1015 is OK: OK slave_io_state Slave_IO_Running: Yes [04:12:07] jynus: ok, so just rebooted? [04:12:26] i was about to wonder if i should call someone, since not all of it cleared heh. [04:12:27] it just went down, no signs of a reboot [04:12:32] RECOVERY - MariaDB Slave SQL: es2 on es2015 is OK: OK slave_sql_state Slave_SQL_Running: Yes [04:12:53] RECOVERY - MariaDB Slave IO: es2 on es2016 is OK: OK slave_io_state Slave_IO_Running: Yes [04:13:09] i was asking what you did to fix =] [04:13:34] well, I checked the logs [04:14:01] then I started mysql when I saw it was the machine, not mysql that went down (it does not restart automatically on purpose) [04:14:29] gtid should take care of consistency issues now [04:14:52] Record: 2 [04:14:53] Date/Time: 10/10/2016 03:52:20 [04:14:53] Source: system [04:14:55] Severity: Critical [04:14:57] Description: CPU 1 has an internal error (IERR). [04:14:59] so thats not good [04:15:00] ha [04:15:01] in sel log [04:15:12] I was suspecting something like that [04:15:22] fail over to es2016 as master? [04:15:26] es2 hots gave us problems before [04:15:36] *hosts [04:16:11] RECOVERY - puppet last run on es2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:16:26] still shows all cores to the OS now. [04:17:12] RECOVERY - puppet last run on elastic1027 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:17:46] robh, supposedly this fixed all issues: https://phabricator.wikimedia.org/T139714 [04:18:05] seems not =[ [04:18:08] and this was one of the original problems: https://phabricator.wikimedia.org/T130702 [04:18:45] I think BIOS or boards have issues [04:19:33] yes, I am going to put it as a slave [04:19:56] so it is easier to depool [04:20:09] someday i hope servermon will have a history of any tasks associated with a host =] (it would be nice to track all repairs needed on any given host easily) [04:21:42] well, I asked that, the answer was the tag is ops-codfw or equivalent [04:21:56] we could update the wiki [04:21:58] too [04:22:19] go to sleep, there is not much to do now [04:22:27] I can handle the rest [04:23:10] yeah, but as someone who tries to research host repairs over time, i'll attest that searching an entire project and then hoping to hit on the host name within the body of a task isnt quick =] [04:23:27] 06Operations, 10ops-codfw, 10DBA: es2015 crashed with no logs - https://phabricator.wikimedia.org/T147769#2702976 (10jcrespo) [04:23:44] glad you were about =] [04:24:10] to be fair [04:24:21] I read db1015, and I got nervous [04:24:43] *es1015 [04:25:08] ahh eqiad based page, panic! heh [04:25:55] I think es1015 may have complained [04:26:09] as it is a slave of es2015 [04:31:06] (03PS1) 10Jcrespo: mariadb: Promote es2014 as the new es2 master of codfw [puppet] - 10https://gerrit.wikimedia.org/r/315042 (https://phabricator.wikimedia.org/T147769) [04:32:51] now the point is how to coordinate that without sending an alert [04:33:25] i imagine there are mysql backend commands in conjunction with that patchset? [04:33:33] that patch just tells mediawiki right? [04:33:59] sorry, thats the site.pp, i assumed it was the db file, heh [04:34:15] yes [04:34:16] rephrase: What is involved in switching masters? [04:34:27] (no need to list all the commands now, but just curious) [04:34:35] so, several things on several places [04:34:46] the most important is the following patch [04:39:26] (03PS1) 10Jcrespo: mariadb: Depool es2015 (master, crashed); replaced by es2016 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315043 (https://phabricator.wikimedia.org/T147769) [04:40:58] (03PS2) 10Jcrespo: mariadb: Promote es2016 as the new es2 master of codfw [puppet] - 10https://gerrit.wikimedia.org/r/315042 (https://phabricator.wikimedia.org/T147769) [04:41:03] yeah that one i was looking at to find out it was a master, heh [04:41:13] db-cdofw.php [04:41:20] easier place is dbtree.wikimedia.org [04:41:54] now, none of the above patches will actually change the topology [04:42:17] that is why I need to control things at infrastructure side [04:42:29] with a proxy or something [04:42:49] puppet one will just change monitoring and heartbeat execution [04:43:08] mediawiki will just change where to send stuff, but it will not touch the servers themselves [04:43:33] so I still have to do a "CHANGE MASTER TO ..." query on the servers [04:43:51] ok, that was my understanding of it, but its been awhile =] [04:44:02] what it has changed for best [04:44:04] its also why if you werent around you were going to get a text from me ;] [04:44:25] is that I only have to change master to master_host='new host' [04:44:26] though i guess its indeed acodfw master, not full master for all of es2 [04:44:36] so i guess waking you up wouldnt have been needed? [04:44:50] it is ok, it is a master of the master of eqiad [04:45:01] so it is sensitive [04:45:05] cool, good to know [04:45:28] if it was 1015 in anything other than a slave lag warning it would have been a much faster response to page by me [04:45:54] yes [04:46:08] I never tested mediawiki's failover there [04:46:16] in theory, it should have done nothing [04:46:52] but as I have not tested it, maybe it would have created failures on 50% of edits or just move the whole thing to read-only [04:48:00] I think I am going to move the slaves first, that will create an incinga alert, but I have downtime it [04:48:35] jaime doesn't sleep, he just watches db slave lag with one eye closed. ;D [04:48:43] maybe [04:50:44] !log changing topology of es2 @ codfw [04:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:52:24] well, that page caught me as i was wandering in from doing boat repair work. im going afk again to get rid of the various boat gunks i've accumulated from today [04:52:30] s/page/sms [04:53:05] ttyl jaime, thanks for fixing and rebalancing [04:59:51] (03PS3) 10Jcrespo: mariadb: Promote es2016 as the new es2 master of codfw [puppet] - 10https://gerrit.wikimedia.org/r/315042 (https://phabricator.wikimedia.org/T147769) [05:00:26] (03CR) 10Jcrespo: [C: 032] mariadb: Promote es2016 as the new es2 master of codfw [puppet] - 10https://gerrit.wikimedia.org/r/315042 (https://phabricator.wikimedia.org/T147769) (owner: 10Jcrespo) [05:09:16] (03CR) 10Jcrespo: [C: 032] mariadb: Depool es2015 (master, crashed); replaced by es2016 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315043 (https://phabricator.wikimedia.org/T147769) (owner: 10Jcrespo) [05:11:07] !log jynus@tin Synchronized wmf-config/db-codfw.php: mariadb: Depool es2015 (master, crashed); replaced by es2016 (duration: 00m 49s) [05:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:14:38] and I would be able to look at the logs if T147748 didn't have 80000 entries on the last 5 minutes [05:14:40] T147748: Large number of CategoryMembershipChangeJob::run updates are failing - https://phabricator.wikimedia.org/T147748 [05:30:53] PROBLEM - puppet last run on db1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:42:52] !log reseting slave on es2 eqiad master (es1015) [05:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:48:09] there were some hanging events from the original master causing trouble on the heartbeat table [05:54:37] RECOVERY - puppet last run on db1011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:10:55] (03PS1) 10Marostegui: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315045 (https://phabricator.wikimedia.org/T145533) [06:11:51] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315045 (https://phabricator.wikimedia.org/T145533) (owner: 10Marostegui) [06:12:21] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315045 (https://phabricator.wikimedia.org/T145533) (owner: 10Marostegui) [06:13:56] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1082 to upgrade its RAID controller firmware - T145533 (duration: 00m 50s) [06:13:57] T145533: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533 [06:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:15:34] !log db1082: Upgrading RAID controller firmware [06:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:16:15] marostegui, can you plese stop creating outages? [06:16:28] ? [06:16:42] https://logstash.wikimedia.org/goto/f4ad9be0a320a1e627382570eb080999 [06:16:49] I assume this is not the first time [06:17:02] according to the logs [06:18:02] the depooling process is documented here: https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Normal_depooling [06:18:41] It had no connections on show processlist [06:19:08] the logs disagree [06:19:33] "Can't connect to MySQL server on '10.64.0.94' (111)" [06:19:36] thousands of those [06:20:37] • The depool is not immediate, during that time, the server keeps receiving queries, which may be undesirable in many cases [06:20:41] I guess that's it [06:21:16] How long should I normally wait, or how long do you normally wait? [06:22:11] I think it may not be your fault [06:22:23] I am checking and it was giving errors way before I touched it actually [06:22:26] at 5:45 even [06:22:29] I am suspecting wikidata wiki is running backups [06:22:34] or something on all servers [06:22:57] because of T147748 [06:23:04] T147748: Large number of CategoryMembershipChangeJob::run updates are failing - https://phabricator.wikimedia.org/T147748 [06:23:15] Right [06:24:13] not 15K at the time you depool: https://logstash.wikimedia.org/goto/79084075172d75f9a60d936a1ec26847 [06:25:15] it goes from 1 in 10 minutes to 8K in 10 minutes [06:25:38] Yeah, I depooled at 6:13 and stopped mysql at 6:16 and it was showing 0 connections [06:27:05] And the errors at 6:00 are around 15k [06:27:12] So something else, as you said, is failing [06:31:08] marostegui, please do not assume change takes effect immediately [06:31:29] jynus: I didn't, I was checking show processlist until it was empty for a few run [06:31:39] it doesn't matter [06:31:59] show processlist isn't perfect (it doesnt detect but current ongoing queries [06:32:11] plus there can be scripts that try to connect that have the old config [06:32:18] yeah, I know. I think I will do tcpdumps from now on too :) [06:32:29] so there can be new connections too [06:32:39] even if there are not current ones [06:32:42] Ah, I see [06:32:45] Go it [06:32:48] got it [06:33:27] now, that is a bug (as I said, problably there is something wrong going on with wikidata) [06:33:31] but still [06:34:16] Yep, good advice :) [06:35:14] I guess we should investigate now what's causing that bug? [06:35:24] see my ticket above [06:36:05] Ah, right - fast :) [06:36:05] I already filed it as unbreak now [06:36:13] 06Operations, 07HHVM, 07discovery-system: Restart HHVM on API appservers every about 48 hours - https://phabricator.wikimedia.org/T147773#2703054 (10Joe) [06:36:30] not that one [06:37:41] yes, that one [06:37:57] note the 99.8% of those on wikidatawiki [06:38:09] vs. wikidata + dewiki [06:38:18] which means it is logical, not physical [06:38:40] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I think the other patch is more to the point you want to reach here." [puppet] - 10https://gerrit.wikimedia.org/r/302774 (owner: 10Dzahn) [06:38:41] What do you mean with not physical? [06:38:48] not mysql-related [06:39:02] PROBLEM - puppet last run on analytics1050 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 8 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[ngrep] [06:39:09] on the reboot you just did, however, it affects all wikis, specially dewiki and wikidata wiki [06:39:17] which means it was the mysql stop itself [06:39:24] (03CR) 10Giuseppe Lavagetto: [C: 031] add mapped v6 IPs for terbium and wasat [puppet] - 10https://gerrit.wikimedia.org/r/302649 (owner: 10Dzahn) [06:40:54] Ah, I get what you mean [06:42:03] Then clearly 3 minutes isn't enough and I will be doing tcpdumps too now [06:42:36] <_joe_> marostegui: `ss -tunlp` can help too [06:42:48] <_joe_> sorry -tunap [06:42:52] Thanks :) [06:43:16] <_joe_> you can then get all the live ESTABLISHED connections to mysql [06:43:30] marostegui, to be fair, not even that was probably enough [06:43:34] today [06:43:36] (03CR) 10Giuseppe Lavagetto: [C: 031] videoscalers: Restrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/314699 (owner: 10Muehlenhoff) [06:43:40] because this ongoing issue [06:44:07] _joe_: Ah, right, useful indeed [06:44:34] and if someone is performing backups from the main servers without communicating, I am going to be very sad [06:44:35] (03CR) 10Giuseppe Lavagetto: [C: 031] "I didn't check that everything is in place (it seems so), but the change is in itself not dangerous." [puppet] - 10https://gerrit.wikimedia.org/r/314469 (owner: 10Dereckson) [06:44:35] jynus: it would have kept failing anyways [06:45:07] because that has been causing problems all weekend [06:45:09] 06Operations, 10ops-codfw, 10ops-eqiad, 10media-storage: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756#2703073 (10Marostegui) I have upgraded db1082: ``` root@db1082:~# hpssacli controller slot=1 show | grep -i firmware Firmware Version: 4.02 ``` [06:45:10] (03CR) 10Giuseppe Lavagetto: [C: 031] Add ec.wikimedia.org to Apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/314470 (https://phabricator.wikimedia.org/T135521) (owner: 10Dereckson) [06:45:35] and specially, if it doesn't refress the pooling state often [06:46:11] 06Operations, 10DBA, 13Patch-For-Review: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2703074 (10Marostegui) Upgraded: ``` root@db1082:~# hpssacli controller slot=1 show | grep -i firmware Firmware Version: 4.02 ``` I will slowly get this server back to the pool but I think this... [06:46:28] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "This was set for historical reasons for analytics; we can just remove this file instead of setting X-Powered-By for static assets." [puppet] - 10https://gerrit.wikimedia.org/r/314519 (owner: 10Elukey) [06:47:58] PROBLEM - puppet last run on labvirt1011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-mwclient] [06:49:14] so, from the logs, it took 4 more minutes to stop getting errors [06:49:29] jynus: After mysql got stopped? [06:49:33] yep [06:49:46] so that is a total of 7 minutes since it was depooled [06:50:11] well, it depends on $random_long_running process running at the time :-) [06:51:17] and that is the reason why a) rolling restarts of mysql on our env are painful [06:51:26] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "in general LGTM, but I would not have used conftool here." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/310819 (https://phabricator.wikimedia.org/T147424) (owner: 10Filippo Giunchedi) [06:51:42] b) why we should go for a proxy to control topology [06:52:03] jynus: I cannot imagine how painful migrating mysql -> mariadb it must have been :| [06:53:36] well, it was getting boring, that is why I bought hardware that crashes randomly to make it more interesting: T147769 [06:53:37] T147769: es2015 crashed with no logs - https://phabricator.wikimedia.org/T147769 [06:53:58] X-DDD [06:54:28] is that a dell? [06:55:13] I've had issues with CPU Ierr in the past, and Dell decided to change the motherboard as were were unable to find the root cause. They blamed chassis tempreature first [06:55:19] (03PS2) 10Elukey: Remove the HHVM version for X-Powered-By (static websites) [puppet] - 10https://gerrit.wikimedia.org/r/314519 [06:55:26] But in the end it was proven that it wasn't [06:58:01] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 531 bytes in 0.012 second response time [07:03:03] RECOVERY - puppet last run on analytics1050 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:12:03] RECOVERY - puppet last run on labvirt1011 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [07:15:33] (03CR) 10Muehlenhoff: [C: 031] gerrit: mv standard incl to role, rm duplicate firewall [puppet] - 10https://gerrit.wikimedia.org/r/314768 (owner: 10Dzahn) [07:19:21] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 3670 bytes in 0.069 second response time [07:27:13] 06Operations, 10ops-eqiad, 10DBA: Physically move db1053 to a different rack - https://phabricator.wikimedia.org/T147774#2703076 (10Marostegui) [07:27:33] (03CR) 10Alexandros Kosiaris: [C: 031] add mapped v6 IPs for terbium and wasat [puppet] - 10https://gerrit.wikimedia.org/r/302649 (owner: 10Dzahn) [07:28:04] (03CR) 10Alexandros Kosiaris: [C: 031] gerrit: mv standard incl to role, rm duplicate firewall [puppet] - 10https://gerrit.wikimedia.org/r/314768 (owner: 10Dzahn) [07:29:49] !log installing php security updates on jessie systems [07:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:30:53] PROBLEM - puppet last run on analytics1053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:34:38] (03CR) 10Alexandros Kosiaris: [C: 031] network/constants: add maintenance hosts (v4) [puppet] - 10https://gerrit.wikimedia.org/r/314778 (owner: 10Dzahn) [07:34:50] !log Deploying schema change on S4 codfw only commonswiki.revision - T147305 [07:34:52] T147305: Unify commonswiki.revision - https://phabricator.wikimedia.org/T147305 [07:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:35:09] (03CR) 10Alexandros Kosiaris: "I prefer this to https://gerrit.wikimedia.org/r/#/c/302774/" [puppet] - 10https://gerrit.wikimedia.org/r/314778 (owner: 10Dzahn) [07:36:53] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I just realized that you should add entries for labs and labstest as well (labstest you be the same hosts, labs probably some dummy ones)" [puppet] - 10https://gerrit.wikimedia.org/r/314778 (owner: 10Dzahn) [07:37:18] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Yes the one in https://gerrit.wikimedia.org/r/#/c/314778/2 is probably better" [puppet] - 10https://gerrit.wikimedia.org/r/302774 (owner: 10Dzahn) [07:42:07] PROBLEM - DPKG on mw2219 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:42:07] PROBLEM - DPKG on mw2228 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:42:07] PROBLEM - DPKG on mw2216 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:42:08] PROBLEM - DPKG on mw2225 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:42:18] PROBLEM - DPKG on mw2230 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:42:23] 06Operations, 10DBA, 13Patch-For-Review: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2703104 (10Marostegui) 05Open>03Resolved [07:42:30] PROBLEM - DPKG on mw2223 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:42:30] PROBLEM - DPKG on mw2220 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:42:37] PROBLEM - DPKG on mw2224 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:42:37] PROBLEM - DPKG on mw2218 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:42:51] PROBLEM - DPKG on mw2227 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:42:51] PROBLEM - DPKG on mw2217 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:42:52] PROBLEM - DPKG on mw2229 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:43:25] looking into the dpkg errors on mw2* [07:43:28] PROBLEM - DPKG on mw2215 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:43:28] PROBLEM - DPKG on mw2221 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:43:30] PROBLEM - puppet last run on mw2224 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[php-pear] [07:43:30] PROBLEM - puppet last run on mw2222 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[php-pear] [07:43:30] PROBLEM - DPKG on mw2222 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:43:37] PROBLEM - DPKG on mw2231 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:43:37] PROBLEM - DPKG on mw2232 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:43:37] PROBLEM - DPKG on mw2226 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:43:53] PROBLEM - puppet last run on mw2086 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lldpd] [07:47:48] RECOVERY - DPKG on mw2220 is OK: All packages OK [07:48:10] RECOVERY - DPKG on mw2217 is OK: All packages OK [07:48:49] PROBLEM - puppet last run on mw2225 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[php-pear] [07:48:49] PROBLEM - puppet last run on mw2220 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[php-pear] [07:48:49] PROBLEM - puppet last run on mw2221 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[php-pear] [07:48:49] PROBLEM - puppet last run on mw2229 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[php-pear] [07:48:49] PROBLEM - puppet last run on mw2227 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[php-pear] [07:58:46] 06Operations: cronspam from cobalt after the Gerrit migration - https://phabricator.wikimedia.org/T147776#2703108 (10elukey) [08:00:15] (03PS12) 10Elukey: Introduce the Imply Pivot UI's module and role [puppet] - 10https://gerrit.wikimedia.org/r/312495 (https://phabricator.wikimedia.org/T138262) [08:02:41] 06Operations, 10Gerrit: cronspam from cobalt after the Gerrit migration - https://phabricator.wikimedia.org/T147776#2703125 (10hashar) I am pretty sure that is (**was**?) used for our community metrics tool at http://korma.wmflabs.org/browser/ with some cron fetching something like https://gerrit.wikimedia.or... [08:03:46] (03CR) 10Elukey: [C: 032] Introduce the Imply Pivot UI's module and role [puppet] - 10https://gerrit.wikimedia.org/r/312495 (https://phabricator.wikimedia.org/T138262) (owner: 10Elukey) [08:05:22] PROBLEM - puppet last run on mw2232 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[php-pear] [08:05:24] 06Operations, 10ops-eqiad, 10DBA: db1065: Degraded RAID - https://phabricator.wikimedia.org/T147396#2703129 (10Marostegui) All good now, thanks ``` Device Present ================ Virtual Drives : 1 Degraded : 0 Offline : 0 Physical Devices : 14 Disk... [08:05:33] 06Operations, 10ops-eqiad, 10DBA: db1065: Degraded RAID - https://phabricator.wikimedia.org/T147396#2703130 (10Marostegui) 05Open>03Resolved [08:06:26] 06Operations: dubnium disk full - https://phabricator.wikimedia.org/T147173#2703131 (10akosiaris) 05Resolved>03Open Reopening, has happened again, the issue is once more due to mx1001's queues. Probably a backscatter spam attack again, investigating [08:06:49] RECOVERY - DPKG on mw2230 is OK: All packages OK [08:07:09] RECOVERY - DPKG on mw2224 is OK: All packages OK [08:07:09] RECOVERY - DPKG on mw2218 is OK: All packages OK [08:07:20] (03PS14) 10Paladox: phabricator: Create & configure a phabricator_stopwords table for innodb [puppet] - 10https://gerrit.wikimedia.org/r/314286 (https://phabricator.wikimedia.org/T146673) [08:08:02] RECOVERY - puppet last run on mw2220 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [08:08:02] RECOVERY - DPKG on mw2222 is OK: All packages OK [08:08:09] RECOVERY - puppet last run on mw2224 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:08:09] RECOVERY - puppet last run on mw2222 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [08:08:09] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [08:08:31] 06Operations, 10Gerrit: cronspam from cobalt after the Gerrit migration - https://phabricator.wikimedia.org/T147776#2703133 (10hashar) On `lead.wikimedia.org`: ``` $ ls -l /var/www/reviewer-counts.json -rw-r--r-- 1 gerrit2 root 258 Oct 10 01:59 /var/www/reviewer-counts.json ``` [08:08:53] 06Operations, 10Gerrit: cronspam from cobalt after the Gerrit migration - https://phabricator.wikimedia.org/T147776#2703134 (10hashar) [08:09:19] RECOVERY - DPKG on mw2219 is OK: All packages OK [08:10:40] RECOVERY - DPKG on mw2221 is OK: All packages OK [08:10:42] RECOVERY - puppet last run on mw2219 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:10:42] RECOVERY - DPKG on mw2231 is OK: All packages OK [08:10:49] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [08:11:59] RECOVERY - DPKG on mw2225 is OK: All packages OK [08:12:21] RECOVERY - DPKG on mw2223 is OK: All packages OK [08:12:39] RECOVERY - DPKG on mw2227 is OK: All packages OK [08:12:40] RECOVERY - DPKG on mw2229 is OK: All packages OK [08:12:41] !log clear mx1001's queues from backscatter spam T147173 [08:12:43] T147173: dubnium disk full - https://phabricator.wikimedia.org/T147173 [08:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:13:29] RECOVERY - puppet last run on mw2225 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:13:30] RECOVERY - puppet last run on mw2221 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:13:47] RECOVERY - puppet last run on mw2229 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [08:13:47] RECOVERY - puppet last run on mw2227 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:14:20] (03PS1) 10Muehlenhoff: Set default openldap log level to "sync" [puppet] - 10https://gerrit.wikimedia.org/r/315046 [08:14:58] (03PS2) 10Muehlenhoff: Set default openldap log level to "sync" [puppet] - 10https://gerrit.wikimedia.org/r/315046 (https://phabricator.wikimedia.org/T147173) [08:16:12] RECOVERY - DPKG on mw2232 is OK: All packages OK [08:18:53] RECOVERY - puppet last run on mw2223 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [08:19:00] RECOVERY - puppet last run on mw2232 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:19:42] PROBLEM - puppet last run on stat1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:20:13] this is me --^, working on it [08:21:40] RECOVERY - puppet last run on mw2231 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [08:23:13] (03CR) 10Alexandros Kosiaris: [C: 031] Set default openldap log level to "sync" [puppet] - 10https://gerrit.wikimedia.org/r/315046 (https://phabricator.wikimedia.org/T147173) (owner: 10Muehlenhoff) [08:23:46] (03CR) 10Jcrespo: "Looks good now, I missed the '@' on the previous patch: https://puppet-compiler.wmflabs.org/4249/" [puppet] - 10https://gerrit.wikimedia.org/r/314286 (https://phabricator.wikimedia.org/T146673) (owner: 10Paladox) [08:23:56] (03CR) 10Jcrespo: [C: 032] phabricator: Create & configure a phabricator_stopwords table for innodb [puppet] - 10https://gerrit.wikimedia.org/r/314286 (https://phabricator.wikimedia.org/T146673) (owner: 10Paladox) [08:24:31] PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Scap_source[analytics/pivot] [08:26:23] RECOVERY - Disk space on dubnium is OK: DISK OK [08:27:40] (03PS2) 10Hashar: contint: add phpdbg for code coverage [puppet] - 10https://gerrit.wikimedia.org/r/314563 (https://phabricator.wikimedia.org/T147778) [08:29:27] (03CR) 10Hashar: "I have created a standalone Jenkins job https://integration.wikimedia.org/ci/job/mediawiki-core-code-coverage-php7/ which fails on master " [puppet] - 10https://gerrit.wikimedia.org/r/314563 (https://phabricator.wikimedia.org/T147778) (owner: 10Hashar) [08:29:57] (03PS1) 10Elukey: Fix wrong Pivot dependency in its puppet class [puppet] - 10https://gerrit.wikimedia.org/r/315047 (https://phabricator.wikimedia.org/T138262) [08:32:28] (03CR) 10Elukey: [C: 032] Fix wrong Pivot dependency in its puppet class [puppet] - 10https://gerrit.wikimedia.org/r/315047 (https://phabricator.wikimedia.org/T138262) (owner: 10Elukey) [08:37:47] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint: Unmet dependencies around postgis apt packages on maps* servers - https://phabricator.wikimedia.org/T147780#2703199 (10Gehel) [08:38:10] !log Dropping hitcounter, _counter memory tables in S2 - dbstore2002 - T132837 [08:38:12] T132837: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837 [08:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:39:45] (03PS1) 10Jcrespo: Change phabricator misc dbs to use puppet TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/315049 (https://phabricator.wikimedia.org/T111654) [08:40:41] ACKNOWLEDGEMENT - puppet last run on maps2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 20 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[postgresql-9.4-postgis] Gehel Investigation in progress on T147780 [08:40:41] ACKNOWLEDGEMENT - puppet last run on maps2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 24 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[postgresql-9.4-postgis] Gehel Investigation in progress on T147780 [08:40:41] ACKNOWLEDGEMENT - puppet last run on maps2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 14 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[postgresql-9.4-postgis] Gehel Investigation in progress on T147780 [08:40:41] ACKNOWLEDGEMENT - puppet last run on maps2004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 10 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[postgresql-9.4-postgis] Gehel Investigation in progress on T147780 [08:40:46] !log Populated the sites/ site_identifiers tables on olowiki (T146614) [08:40:47] T146614: Wikibase/Wikidata configuration for olo.wikipedia - https://phabricator.wikimedia.org/T146614 [08:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:42:50] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2703221 (10jcrespo) Above commands as of now: ``` $ sudo salt -C 'G@cluster:mysql and G@site:eqiad' cmd.run 'grep -l 'server\.key' /etc/my.cnf' | grep -c '/etc/my\.cnf' 102 $ sudo salt -C '... [08:53:41] (03CR) 10Jcrespo: [C: 032] Change phabricator misc dbs to use puppet TLS certificates [puppet] - 10https://gerrit.wikimedia.org/r/315049 (https://phabricator.wikimedia.org/T111654) (owner: 10Jcrespo) [08:53:51] !log reboot graphite2001 and graphite1001 for trusty kernel upgrade [08:53:56] !log Dropping hitcounter, _counter memory tables in S2 - dbstore1001 - T132837 [08:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:53:57] T132837: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837 [08:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:54:49] RECOVERY - puppet last run on dubnium is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [08:56:01] !log reboot db1043 to test new mysql configuration and general upgrade- proxy will complain [08:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:57:50] !log Dropping hitcounter, _counter memory tables in S2 - dbstore1002 - T132837 [08:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:01:24] (03PS1) 10Jcrespo: Update phabricator my.cnf config template to include TLS config [puppet] - 10https://gerrit.wikimedia.org/r/315051 (https://phabricator.wikimedia.org/T111654) [09:08:34] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint: Unmet dependencies around postgis apt packages on maps* servers - https://phabricator.wikimedia.org/T147780#2703257 (10Gehel) Puppet declares `package { "postgresql-${pgversion}-postgis": }`, which is treated as a regex and expanded to multiple versi... [09:12:43] PROBLEM - pivot on stat1001 is CRITICAL: Connection refused [09:15:45] moritzm: graphite kernel upgrade is done [09:16:37] ok, thanks [09:18:28] pivot on stat1001 is mine, I have some issues with scap and first deployment, working on it [09:20:41] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:23:25] (03CR) 10Hashar: [C: 031] Activate subphrases autocomplete on wikisources, mw.org and wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314257 (https://phabricator.wikimedia.org/T146208) (owner: 10DCausse) [09:24:44] 06Operations, 10MediaWiki-General-or-Unknown, 10Traffic, 10media-storage: Mediawiki thumbnail requests for 0px should result in http 400 not 500 - https://phabricator.wikimedia.org/T147784#2703298 (10fgiunchedi) [09:25:24] (03CR) 10Jcrespo: [C: 032] Update phabricator my.cnf config template to include TLS config [puppet] - 10https://gerrit.wikimedia.org/r/315051 (https://phabricator.wikimedia.org/T111654) (owner: 10Jcrespo) [09:26:13] PROBLEM - Disk space on labstore1003 is CRITICAL: DISK CRITICAL - free space: /boot 10 MB (4% inode=99%) [09:27:53] moritzm: there are 7 differente kernels in labstore1003 ^^^ [09:28:06] FYI ;) [09:28:25] yeah, I saw that on machines with small / partition (normally very old ones) [09:28:32] PROBLEM - puppet last run on graphite1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:28:36] kernels + apt cache can take 1-2GB [09:28:58] (03PS1) 10Alexandros Kosiaris: icinga: Purge /etc/nagios-plugins/config [puppet] - 10https://gerrit.wikimedia.org/r/315055 [09:29:19] this one has a separate boot partition... 236MB ;) [09:29:35] volans: thanks, I'll drop some of the older ones [09:30:02] volans, same issue :-) [09:30:08] is it on LVM? [09:30:40] !log pruning older, unused kernel images on labstore1003 [09:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:31:33] (03PS1) 10Elukey: Fix Scap name for the Analytics Pivot repository [puppet] - 10https://gerrit.wikimedia.org/r/315056 (https://phabricator.wikimedia.org/T138262) [09:31:34] RECOVERY - Disk space on labstore1003 is OK: DISK OK [09:32:10] !log rolling reboot of swift frontend servers in eqiad for kernel security update [09:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:33:40] jynus: yes it has LVM but the /boot one is on sda1 directly :-P [09:33:44] ah [09:33:53] yes, rob told me we do not do that anymore [09:34:01] and this is a good argument [09:34:31] (03CR) 10Elukey: [C: 032] Fix Scap name for the Analytics Pivot repository [puppet] - 10https://gerrit.wikimedia.org/r/315056 (https://phabricator.wikimedia.org/T138262) (owner: 10Elukey) [09:35:08] at some point we should check older servers with bad partitioning [09:35:31] ok, I am restarting now db1043 for real [09:36:06] I think dbstore is going to complan, too [09:39:11] !log Dropping hitcounter, _counter memory tables in S2 - db1063- T132837 [09:39:12] T132837: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837 [09:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:42:37] https://phabricator.wikimedia.org/P4187 jynus is that "normal"? [09:43:55] first of all, careful when pasting logs from the db in public- sometimes they contain private queries [09:44:30] yeah, I checked, nothing relevant there :) [09:45:03] marostegui, I would say- compare to the other slaves [09:45:51] a disconnection here and there should be ok, we retry infinitely [09:45:59] but that seems more frequent than usual [09:46:04] check network [09:47:38] I checked another slave and it doesn't have those frequent disconnections, at least db1063 is not pooled so it is not impacting [09:47:41] I am going to check [09:48:00] the first parts "Semi-sync replication switched OFF." [09:48:06] (because timeuot) [09:48:23] and "Event Scheduler: [root@localhost][ops.wmf_master_wikiuser_sleep] Unknown thread id" are normal [09:48:39] (because it tries to kill threads that already finished) [09:49:02] Sure, it was more about the network flapping that much [09:49:19] check also the events around those addresses [09:49:46] rememer the network saturation earlier last week [09:52:33] That is true [09:52:50] (03PS8) 10Paladox: phabricator: Reduce innodb_ft_min_token_size from 3 to 1 [puppet] - 10https://gerrit.wikimedia.org/r/313235 (https://phabricator.wikimedia.org/T146673) [09:53:12] (03CR) 10jenkins-bot: [V: 04-1] phabricator: Reduce innodb_ft_min_token_size from 3 to 1 [puppet] - 10https://gerrit.wikimedia.org/r/313235 (https://phabricator.wikimedia.org/T146673) (owner: 10Paladox) [09:55:07] (03PS9) 10Paladox: phabricator: Reduce innodb_ft_min_token_size from 3 to 1 [puppet] - 10https://gerrit.wikimedia.org/r/313235 (https://phabricator.wikimedia.org/T146673) [09:55:14] RECOVERY - puppet last run on graphite1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:55:14] PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Scap_source[analytics/pivot/deploy] [09:55:29] (03CR) 10jenkins-bot: [V: 04-1] phabricator: Reduce innodb_ft_min_token_size from 3 to 1 [puppet] - 10https://gerrit.wikimedia.org/r/313235 (https://phabricator.wikimedia.org/T146673) (owner: 10Paladox) [09:56:56] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1043.eqiad.wmnet:3306 - retry-time: 60 retries: 86400 message: Cant connect to MySQL server on db1043.eqiad.wmnet (111 Connection refused) [09:57:06] (03Abandoned) 10Paladox: phabricator: Reduce innodb_ft_min_token_size from 3 to 1 [puppet] - 10https://gerrit.wikimedia.org/r/313235 (https://phabricator.wikimedia.org/T146673) (owner: 10Paladox) [09:57:14] (03Restored) 10Paladox: phabricator: Reduce innodb_ft_min_token_size from 3 to 1 [puppet] - 10https://gerrit.wikimedia.org/r/313235 (https://phabricator.wikimedia.org/T146673) (owner: 10Paladox) [09:57:20] ^that was expected [09:57:22] (03Abandoned) 10Paladox: phabricator: Reduce innodb_ft_min_token_size from 3 to 1 [puppet] - 10https://gerrit.wikimedia.org/r/313235 (https://phabricator.wikimedia.org/T146673) (owner: 10Paladox) [09:57:35] (03Draft1) 10Paladox: phabricator: Reduce innodb_ft_min_token_size from 3 to 1 [puppet] - 10https://gerrit.wikimedia.org/r/315057 (https://phabricator.wikimedia.org/T146673) [09:57:38] (03Draft2) 10Paladox: phabricator: Reduce innodb_ft_min_token_size from 3 to 1 [puppet] - 10https://gerrit.wikimedia.org/r/315057 (https://phabricator.wikimedia.org/T146673) [09:57:53] PROBLEM - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [09:57:53] PROBLEM - haproxy failover on dbproxy1008 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [09:58:49] ^and that was expected 2 [09:59:07] I didn't want to ack them in advance, because if they fail again, I would not notice it [09:59:25] (right now there is no problem, but we are in reducend redundancy) [09:59:48] (03CR) 10Paladox: "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/314286 (https://phabricator.wikimedia.org/T146673) (owner: 10Paladox) [10:02:16] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [10:03:20] RECOVERY - haproxy failover on dbproxy1003 is OK: OK check_failover servers up 2 down 0 [10:03:20] RECOVERY - haproxy failover on dbproxy1008 is OK: OK check_failover servers up 2 down 0 [10:03:32] and that is the recovery^ [10:04:16] the network timeouts match when there are spikes of updates [10:04:54] !log running ALTER TABLE search_documentfield ENGINE=InnoDB, FORCE; on phabricato db replica (db1043) [10:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:05:54] (03CR) 10Alexandros Kosiaris: [C: 032] icinga: Purge /etc/nagios-plugins/config [puppet] - 10https://gerrit.wikimedia.org/r/315055 (owner: 10Alexandros Kosiaris) [10:05:57] (03PS2) 10Alexandros Kosiaris: icinga: Purge /etc/nagios-plugins/config [puppet] - 10https://gerrit.wikimedia.org/r/315055 [10:05:59] (03CR) 10Alexandros Kosiaris: [V: 032] icinga: Purge /etc/nagios-plugins/config [puppet] - 10https://gerrit.wikimedia.org/r/315055 (owner: 10Alexandros Kosiaris) [10:06:01] ^I think that should force the table to recreate the indexes with the right config [10:10:57] (03PS4) 10Filippo Giunchedi: Separate Thumbor 404s into their own log [puppet] - 10https://gerrit.wikimedia.org/r/313899 (owner: 10Gilles) [10:12:50] (03CR) 10Filippo Giunchedi: [C: 032] Separate Thumbor 404s into their own log [puppet] - 10https://gerrit.wikimedia.org/r/313899 (owner: 10Gilles) [10:14:36] 06Operations, 10vm-requests: EQIAD|CODFW: (2) VM request for zotero - https://phabricator.wikimedia.org/T147409#2703402 (10akosiaris) VMs will be named sca1003, sca1004 in accordance with our naming policy https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions#Servers [10:14:48] PROBLEM - puppet last run on analytics1049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:19:08] (03PS1) 10Alexandros Kosiaris: Introduce sca1003, sca1004 and sca2003, sca2004 as VMs [dns] - 10https://gerrit.wikimedia.org/r/315060 (https://phabricator.wikimedia.org/T147409) [10:19:57] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:20:30] RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:21:51] (03PS1) 10Gilles: Make thumbor use /var/run controlled by systemd instead of /tmp [puppet] - 10https://gerrit.wikimedia.org/r/315062 [10:23:12] (03CR) 10jenkins-bot: [V: 04-1] Make thumbor use /var/run controlled by systemd instead of /tmp [puppet] - 10https://gerrit.wikimedia.org/r/315062 (owner: 10Gilles) [10:24:11] (03PS2) 10Gilles: Make thumbor use /var/run controlled by systemd instead of /tmp [puppet] - 10https://gerrit.wikimedia.org/r/315062 [10:25:03] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce sca1003, sca1004 and sca2003, sca2004 as VMs [dns] - 10https://gerrit.wikimedia.org/r/315060 (https://phabricator.wikimedia.org/T147409) (owner: 10Alexandros Kosiaris) [10:29:04] (03PS1) 10Alexandros Kosiaris: Introduce sca1003, sca1004, sca2003, sca2004 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/315064 (https://phabricator.wikimedia.org/T147409) [10:31:31] (03PS1) 10Marostegui: db-eqiad: Repool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315065 (https://phabricator.wikimedia.org/T145533) [10:39:08] RECOVERY - puppet last run on analytics1049 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [10:40:17] (03PS1) 10Alexandros Kosiaris: Recurse => true as well on /etc/nagios-plugins/config [puppet] - 10https://gerrit.wikimedia.org/r/315066 [10:40:23] (03CR) 10Marostegui: [C: 032] db-eqiad: Repool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315065 (https://phabricator.wikimedia.org/T145533) (owner: 10Marostegui) [10:41:02] (03Merged) 10jenkins-bot: db-eqiad: Repool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315065 (https://phabricator.wikimedia.org/T145533) (owner: 10Marostegui) [10:41:32] (03PS2) 10Alexandros Kosiaris: Recurse => true as well on /etc/nagios-plugins/config [puppet] - 10https://gerrit.wikimedia.org/r/315066 [10:41:37] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Recurse => true as well on /etc/nagios-plugins/config [puppet] - 10https://gerrit.wikimedia.org/r/315066 (owner: 10Alexandros Kosiaris) [10:42:35] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1082 with some small weight after its RAID controller firmware - T145533 (duration: 00m 50s) [10:42:36] T145533: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533 [10:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:52:31] !log Dropping hitcounter, _counter memory tables in S2 - db1069 - T132837 [10:52:32] T132837: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837 [10:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:14:23] (03CR) 10Filippo Giunchedi: prometheus: generate varnish targets from conftool (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/310819 (https://phabricator.wikimedia.org/T147424) (owner: 10Filippo Giunchedi) [11:22:21] 06Operations, 05Prometheus-metrics-monitoring: Upgrade mysqld_exporter to 0.9.0 - https://phabricator.wikimedia.org/T147476#2703507 (10jcrespo) I have upgraded db1043 to it, with no known issues. [11:22:23] (03PS1) 10Mobrovac: Parsoid: Use Scap3 for config-file deploys [puppet] - 10https://gerrit.wikimedia.org/r/315069 (https://phabricator.wikimedia.org/T144596) [11:25:40] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint: Unmet dependencies around postgis apt packages on maps* servers - https://phabricator.wikimedia.org/T147780#2703519 (10Gehel) p:05Triage>03High [11:29:08] (03PS1) 10Marostegui: db-eqiad.php: Increase weight db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315071 (https://phabricator.wikimedia.org/T145533) [11:31:16] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315071 (https://phabricator.wikimedia.org/T145533) (owner: 10Marostegui) [11:31:44] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315071 (https://phabricator.wikimedia.org/T145533) (owner: 10Marostegui) [11:33:16] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase weight for db1082 after its RAID controller firmware - T145533 (duration: 00m 49s) [11:33:17] T145533: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533 [11:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:35:21] (03PS1) 10Elukey: Fix the Pivot's systemd unit with the right exec path [puppet] - 10https://gerrit.wikimedia.org/r/315073 (https://phabricator.wikimedia.org/T138262) [11:35:33] RECOVERY - pivot on stat1001 is OK: TCP OK - 0.002 second response time on port 9090 [11:36:52] (03CR) 10Elukey: [C: 032] Fix the Pivot's systemd unit with the right exec path [puppet] - 10https://gerrit.wikimedia.org/r/315073 (https://phabricator.wikimedia.org/T138262) (owner: 10Elukey) [11:41:05] PROBLEM - DPKG on mw1267 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:41:15] PROBLEM - puppet last run on mw1268 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[php-pear] [11:41:28] PROBLEM - DPKG on mw1263 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:41:28] PROBLEM - DPKG on mw1271 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:41:45] PROBLEM - DPKG on mw1261 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:41:46] PROBLEM - DPKG on mw1268 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:42:44] PROBLEM - DPKG on mw1266 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:42:45] PROBLEM - DPKG on mw1270 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:44:07] RECOVERY - DPKG on mw1263 is OK: All packages OK [11:44:07] RECOVERY - DPKG on mw1271 is OK: All packages OK [11:46:33] RECOVERY - DPKG on mw1267 is OK: All packages OK [11:47:13] RECOVERY - DPKG on mw1268 is OK: All packages OK [11:48:04] RECOVERY - DPKG on mw1266 is OK: All packages OK [11:48:04] RECOVERY - DPKG on mw1270 is OK: All packages OK [11:48:17] PROBLEM - puppet last run on mw1266 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[php-pear] [11:48:30] Anyone deploying? I would like to push out a fix for an UBN bug in a bit [11:49:44] RECOVERY - DPKG on mw1261 is OK: All packages OK [11:52:20] !log swift eqiad-prod: ms-be1022 to weight 2000 T136631 [11:52:21] T136631: rack/setup/deploy ms-be102[2-7] - https://phabricator.wikimedia.org/T136631 [11:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:54:32] going ahead then [12:02:32] (03PS1) 10Alexandros Kosiaris: icinga: Delete check_nrpe.cfg [puppet] - 10https://gerrit.wikimedia.org/r/315076 [12:04:16] (03CR) 10Mobrovac: [C: 031] "Cherry-picked in beta, generates the correct config-vars.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/315069 (https://phabricator.wikimedia.org/T144596) (owner: 10Mobrovac) [12:05:59] (03CR) 10Alexandros Kosiaris: [C: 032] icinga: Delete check_nrpe.cfg [puppet] - 10https://gerrit.wikimedia.org/r/315076 (owner: 10Alexandros Kosiaris) [12:06:02] (03PS2) 10Alexandros Kosiaris: icinga: Delete check_nrpe.cfg [puppet] - 10https://gerrit.wikimedia.org/r/315076 [12:06:04] (03CR) 10Alexandros Kosiaris: [V: 032] icinga: Delete check_nrpe.cfg [puppet] - 10https://gerrit.wikimedia.org/r/315076 (owner: 10Alexandros Kosiaris) [12:06:49] !log hoo@tin Synchronized php-1.28.0-wmf.21/extensions/Wikidata: Update Wikibase, add EntityHandler::supportsCategories (T147748) (duration: 02m 25s) [12:06:49] T147748: Large number of CategoryMembershipChangeJob::run updates are failing - https://phabricator.wikimedia.org/T147748 [12:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:08:14] RECOVERY - puppet last run on mw1268 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:09:58] (03PS1) 10Gehel: Postgresql / postgis: use full package name [puppet] - 10https://gerrit.wikimedia.org/r/315077 (https://phabricator.wikimedia.org/T147780) [12:10:36] (03PS2) 10Gehel: Postgresql / postgis: use full package name [puppet] - 10https://gerrit.wikimedia.org/r/315077 (https://phabricator.wikimedia.org/T147780) [12:12:38] RECOVERY - puppet last run on mw1266 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [12:16:56] (03PS1) 10Hashar: contint: update unattended-upgrade setting [puppet] - 10https://gerrit.wikimedia.org/r/315079 [12:25:42] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint, 13Patch-For-Review: Unmet dependencies around postgis apt packages on maps* servers - https://phabricator.wikimedia.org/T147780#2703574 (10Gehel) labsdb1004 is also concerned with this issue [12:33:31] next [12:33:48] Jouncebot is down. [12:37:04] (03PS3) 10Gehel: Postgresql / postgis: use full package name [puppet] - 10https://gerrit.wikimedia.org/r/315077 (https://phabricator.wikimedia.org/T147780) [12:37:55] 06Operations, 06Discovery, 06WMDE-Analytics-Engineering, 10Wikidata, and 3 others: Add firewall exception to get to wdqs*.codfw.wmnet:8888 from analytics cluster - https://phabricator.wikimedia.org/T146474#2703582 (10Addshore) p:05Triage>03Normal [12:37:58] 06Operations, 06Performance-Team, 10Thumbor: Separate 404s into their own log - https://phabricator.wikimedia.org/T145632#2703583 (10Gilles) [12:38:09] 06Operations, 06Performance-Team, 10Thumbor: Separate 404s into their own log - https://phabricator.wikimedia.org/T145632#2703584 (10Gilles) a:05fgiunchedi>03Gilles [12:40:23] (03PS1) 10Marostegui: db-eqiad.php: Restore db1082 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315080 (https://phabricator.wikimedia.org/T145533) [12:40:40] (03CR) 10Gehel: [C: 032] Postgresql / postgis: use full package name [puppet] - 10https://gerrit.wikimedia.org/r/315077 (https://phabricator.wikimedia.org/T147780) (owner: 10Gehel) [12:42:46] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1082 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315080 (https://phabricator.wikimedia.org/T145533) (owner: 10Marostegui) [12:42:53] (03PS2) 10Hashar: contint: update unattended-upgrade setting [puppet] - 10https://gerrit.wikimedia.org/r/315079 [12:43:14] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1082 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315080 (https://phabricator.wikimedia.org/T145533) (owner: 10Marostegui) [12:43:59] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [12:44:47] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore original weight for db1082 after its RAID controller firmware - T145533 (duration: 00m 55s) [12:44:49] T145533: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533 [12:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:47:26] !log uploaded nodejs 4.6.0 for jessie-wikimedia to carbon [12:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:51:26] PROBLEM - puppet last run on bast3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:51:34] o/ [12:55:04] RECOVERY - puppet last run on maps2003 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [12:55:30] o/ [12:55:35] (03PS1) 10Elukey: Add the Druid's default port to Pivot's hieradata config [puppet] - 10https://gerrit.wikimedia.org/r/315081 (https://phabricator.wikimedia.org/T138262) [12:55:39] hashar: what's the plan with eu swat? [12:55:45] should I do it? [12:55:56] there seems to be just one config change [12:56:04] have some errand to conduct here [12:56:08] namely restroom :D [12:56:10] unusual for a Monday [12:56:21] I guess it has been a quiet week-end [12:56:32] there is one for dcausse at least [12:56:36] that looks straightforward [12:56:40] will come back in roughly 10 mins [12:57:09] hashar: should I start with the deploy on time? [12:57:51] (03CR) 10Elukey: [C: 032] Add the Druid's default port to Pivot's hieradata config [puppet] - 10https://gerrit.wikimedia.org/r/315081 (https://phabricator.wikimedia.org/T138262) (owner: 10Elukey) [12:58:40] (03PS1) 10Alexandros Kosiaris: icinga: Remove /etc/icinga/conf.d [puppet] - 10https://gerrit.wikimedia.org/r/315082 [12:59:05] RECOVERY - puppet last run on maps2004 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [13:00:25] RECOVERY - puppet last run on maps2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:01:11] looks like jouncebot is donw [13:01:17] o/ [13:01:19] oh, here it is [13:01:28] o/ [13:02:00] o/ [13:02:03] zeljkof: go go go :) [13:02:09] ok, in that case... [13:02:13] :) [13:02:13] I can SWAT today! [13:02:19] \o/ [13:02:25] dcausse: I see you are ready [13:02:29] yes I am :) [13:02:56] can you test the commit at 1099, or should I just deploy everywhere? [13:03:08] zeljkof: yes I can test [13:03:24] dcausse: ok, in that case, I will ping you in a few minutes to test [13:03:30] sure [13:05:30] (03PS3) 10Zfilipin: Activate subphrases autocomplete on wikisources, mw.org and wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314257 (https://phabricator.wikimedia.org/T146208) (owner: 10DCausse) [13:05:44] rebasing and +2ing 314257... [13:06:23] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314257 (https://phabricator.wikimedia.org/T146208) (owner: 10DCausse) [13:06:52] (03Merged) 10jenkins-bot: Activate subphrases autocomplete on wikisources, mw.org and wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314257 (https://phabricator.wikimedia.org/T146208) (owner: 10DCausse) [13:08:11] dcausse: the commit is at mw1099, please test [13:08:16] ok testing [13:09:52] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2703608 (10Dereckson) Note an argument against projectgrants and other variants have been given: >>! In T143138#2559236, @Mjohnson_WMF wrote: > Program Officers... [13:10:17] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2703612 (10Dereckson) [13:11:18] zeljkof: hmm does not work on wikitech, but maybe this wiki is special? [13:11:34] it works on mw.org and wikisources [13:11:47] hashar: do you know the answer? ^ [13:11:57] hmm [13:12:11] maybe wikitech cannot be served with mw1099? [13:12:12] dcausse: I know there are things different about wikitech, but not sure for this case [13:12:17] wikitech runs on a single host silver.xxx [13:12:17] ok [13:12:20] ah [13:12:21] wikitech is not part of the cluster [13:12:23] so I dont think we can test it on mw1099 [13:12:26] !log reimage maps-test2002 - T147194 [13:12:27] T147194: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194 [13:12:32] if it works on the others [13:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:12:36] lets sync to whole cluster [13:12:40] zeljkof: so all good, I'll test on wikitech later [13:12:45] hashar, dcausse: ok, in that case deplyoing [13:12:45] then once deployed, check on wikitech that everything works fine [13:12:46] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194#2703617 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['maps-test2002.codfw.wmnet'] ``` The log can be found in `/va... [13:13:05] we can afford to have wikitech slightly broken [13:13:20] it is only a tiny share of traffic and only impacts a few hundred users at most [13:13:36] yes, makes sense [13:13:40] (yeah I should not say that, but the idea is that it is less impact than breaking search on say enwiki :D ) [13:13:58] and even there, that is "just" search :D [13:14:07] he :) [13:14:34] which is unlikely to make the news compared to blank pages being served worldwide for half an hour :D [13:15:04] indeed, I can't argue against that :) [13:15:06] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:314257|Activate subphrases autocomplete on wikisources, mw.org and wikitech (T146208)]] (duration: 00m 50s) [13:15:07] T146208: Enable sub-phrase completion suggester on wikitech, mediawiki.org and wikisource - https://phabricator.wikimedia.org/T146208 [13:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:15:16] zeljkof: thanks! [13:15:20] dcausse: ok, deployed, please test on wikitech [13:15:34] RECOVERY - puppet last run on maps2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:15:42] zeljkof: dcausse: wikitech is only served from Silver [13:15:51] what is that subphrase autocompletion feature about ? [13:15:54] I am being curious [13:16:04] (03PS1) 10Elukey: Move the httpd Proxy rules from stat1002 to localhost for Pivot [puppet] - 10https://gerrit.wikimedia.org/r/315083 (https://phabricator.wikimedia.org/T138262) [13:16:04] zeljkof: it works [13:16:36] dcausse: great! [13:16:44] Dereckson: thanks! [13:16:46] zeljkof: so when you need to test something specific to Wikitech, but aren't willing to send it to the full cluster, you can login to silver and do a scap pull there [13:16:58] Dereckson: hm, interesting [13:17:00] (that's not a canary, it's the prod host for wikitech) [13:17:01] hashar: in the top right box it can suggest subphrases/subpages, e.g. for the page Team Practice Group/Glossary if you search for glossary you can find it [13:17:01] good tip [13:17:11] dcausse: neat :] [13:17:32] hashar: ok, since there are no other patches, closing the swat? [13:17:39] (this was a quick one) [13:17:40] but you need to activate it in your prefs, we haven't made it the default not to break bots [13:17:51] (03CR) 10Elukey: [C: 032] Move the httpd Proxy rules from stat1002 to localhost for Pivot [puppet] - 10https://gerrit.wikimedia.org/r/315083 (https://phabricator.wikimedia.org/T138262) (owner: 10Elukey) [13:18:44] RECOVERY - puppet last run on bast3001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:20:04] !log ending EU SWAT! [13:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:20:17] 06Operations, 10Gerrit: cronspam from cobalt after the Gerrit migration - https://phabricator.wikimedia.org/T147776#2703638 (10Dzahn) root@cobalt:/var/www# touch reviewer-counts.json root@cobalt:/var/www# chown gerrit2 reviewer-counts.json puppetizing will follow tomorrow [13:21:13] (03PS1) 10Hashar: contint: unattended upgrade from distro [puppet] - 10https://gerrit.wikimedia.org/r/315084 [13:23:45] PROBLEM - puppet last run on elastic1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:26:47] (03PS2) 10Hashar: contint: unattended upgrade from distro [puppet] - 10https://gerrit.wikimedia.org/r/315084 [13:29:19] (03PS1) 10Alexandros Kosiaris: icinga: Remove event_profiling_enabled [puppet] - 10https://gerrit.wikimedia.org/r/315085 [13:29:21] (03PS1) 10Alexandros Kosiaris: icinga: normal_check_interval => check_interval [puppet] - 10https://gerrit.wikimedia.org/r/315086 [13:29:23] (03PS1) 10Alexandros Kosiaris: icinga: retry_check_interval => retry_interval [puppet] - 10https://gerrit.wikimedia.org/r/315087 [13:29:47] (03CR) 10BBlack: [C: 031] cache_upload: remove varnish3 VCL compat [puppet] - 10https://gerrit.wikimedia.org/r/314658 (https://phabricator.wikimedia.org/T131502) (owner: 10Ema) [13:30:14] (03CR) 10BBlack: [C: 031] cache_text backend VCL: use bereq in misspass_mangle [puppet] - 10https://gerrit.wikimedia.org/r/314715 (https://phabricator.wikimedia.org/T131503) (owner: 10Ema) [13:30:44] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [13:31:50] (03PS2) 10BBlack: cache_upload: remove varnish3 VCL compat [puppet] - 10https://gerrit.wikimedia.org/r/314658 (https://phabricator.wikimedia.org/T131502) (owner: 10Ema) [13:31:57] 06Operations, 06Performance-Team, 10Thumbor: Separate 404s into their own log - https://phabricator.wikimedia.org/T145632#2703652 (10Gilles) [13:32:03] (03CR) 10BBlack: [C: 032 V: 032] cache_upload: remove varnish3 VCL compat [puppet] - 10https://gerrit.wikimedia.org/r/314658 (https://phabricator.wikimedia.org/T131502) (owner: 10Ema) [13:33:34] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [13:34:07] PROBLEM - HP RAID on ms-be1022 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [13:35:19] (03CR) 10Ottomata: "This should be a no-op, so I think its fine, but, is this the way to do this? It doesn't seem right to put base::firewall in a role. bas" [puppet] - 10https://gerrit.wikimedia.org/r/314338 (owner: 10Andrew Bogott) [13:36:17] (03CR) 10Ottomata: [C: 031] Decommission the old AQS cluster [puppet] - 10https://gerrit.wikimedia.org/r/314542 (https://phabricator.wikimedia.org/T147461) (owner: 10Elukey) [13:36:49] RECOVERY - HP RAID on ms-be1022 is OK: OK: Slot 3: OK: 2I:4:2, 2I:4:1, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [13:38:45] Dropping hitcounter, _counter memory tables in S4 - db1040 (master) - T132837 [13:38:45] T132837: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837 [13:38:54] !log Dropping hitcounter, _counter memory tables in S4 - db1040 (master) - T132837 [13:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:39:05] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [13:39:29] (03CR) 10Ottomata: [C: 031] vk::webrequest - adjust peak rate estimates [puppet] - 10https://gerrit.wikimedia.org/r/314336 (owner: 10BBlack) [13:40:06] (03PS1) 10Gilles: Upgrade to 0.1.25 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/315088 [13:41:49] (03CR) 10Alexandros Kosiaris: [C: 031] "Yes, that's not exactly perfectly suited to the role abstraction level. But we did not really have a good other abstraction level when we " [puppet] - 10https://gerrit.wikimedia.org/r/314338 (owner: 10Andrew Bogott) [13:45:35] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [13:47:11] (03PS2) 10Alexandros Kosiaris: icinga: Remove event_profiling_enabled [puppet] - 10https://gerrit.wikimedia.org/r/315085 [13:47:13] (03PS2) 10Alexandros Kosiaris: icinga: retry_check_interval => retry_interval [puppet] - 10https://gerrit.wikimedia.org/r/315087 [13:47:15] (03PS2) 10Alexandros Kosiaris: icinga: normal_check_interval => check_interval [puppet] - 10https://gerrit.wikimedia.org/r/315086 [13:47:17] (03PS1) 10Alexandros Kosiaris: Specify mode for nagios_hostgroup and nagios::servicegroup [puppet] - 10https://gerrit.wikimedia.org/r/315089 [13:47:19] (03PS1) 10Alexandros Kosiaris: icinga: Kill hostextinfo [puppet] - 10https://gerrit.wikimedia.org/r/315090 [13:48:36] (03PS1) 10BBlack: VCL: remove remaining v3 compat from misc+upload [puppet] - 10https://gerrit.wikimedia.org/r/315091 [13:48:56] RECOVERY - puppet last run on elastic1040 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:49:28] (03CR) 10Alexandros Kosiaris: [C: 032] icinga: Remove /etc/icinga/conf.d [puppet] - 10https://gerrit.wikimedia.org/r/315082 (owner: 10Alexandros Kosiaris) [13:49:32] (03PS2) 10Alexandros Kosiaris: icinga: Remove /etc/icinga/conf.d [puppet] - 10https://gerrit.wikimedia.org/r/315082 [13:49:35] (03CR) 10Alexandros Kosiaris: [V: 032] icinga: Remove /etc/icinga/conf.d [puppet] - 10https://gerrit.wikimedia.org/r/315082 (owner: 10Alexandros Kosiaris) [13:49:45] (03PS2) 10Alexandros Kosiaris: Specify mode for nagios_hostgroup and nagios::servicegroup [puppet] - 10https://gerrit.wikimedia.org/r/315089 [13:50:03] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Specify mode for nagios_hostgroup and nagios::servicegroup [puppet] - 10https://gerrit.wikimedia.org/r/315089 (owner: 10Alexandros Kosiaris) [13:50:15] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:50:56] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [13:51:26] (03CR) 10jenkins-bot: [V: 04-1] icinga: normal_check_interval => check_interval [puppet] - 10https://gerrit.wikimedia.org/r/315086 (owner: 10Alexandros Kosiaris) [13:52:36] PROBLEM - puppet last run on mw1208 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:53:41] (03PS2) 10BBlack: VCL: remove remaining v3 compat from misc+upload [puppet] - 10https://gerrit.wikimedia.org/r/315091 [13:54:10] (03CR) 10jenkins-bot: [V: 04-1] icinga: Kill hostextinfo [puppet] - 10https://gerrit.wikimedia.org/r/315090 (owner: 10Alexandros Kosiaris) [13:55:07] (03PS3) 10BBlack: VCL: remove remaining v3 compat from misc+upload [puppet] - 10https://gerrit.wikimedia.org/r/315091 [13:55:17] (03CR) 10BBlack: [C: 032 V: 032] "compiler no-op on misc+upload nodes" [puppet] - 10https://gerrit.wikimedia.org/r/315091 (owner: 10BBlack) [13:56:03] 06Operations, 10Traffic, 13Patch-For-Review: Upgrade all cache clusters to Varnish 4 - https://phabricator.wikimedia.org/T131499#2703675 (10BBlack) [13:56:06] 06Operations, 10Traffic, 13Patch-For-Review: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502#2703674 (10BBlack) 05Open>03Resolved [13:56:33] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [13:57:10] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194#2703677 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['maps-test2002.codfw.wmnet'] ``` Those hosts were successful: ``` [] ``` [13:57:32] (03PS2) 10BBlack: cache_text backend VCL: use bereq in misspass_mangle [puppet] - 10https://gerrit.wikimedia.org/r/314715 (https://phabricator.wikimedia.org/T131503) (owner: 10Ema) [13:57:38] (03CR) 10BBlack: [C: 032 V: 032] cache_text backend VCL: use bereq in misspass_mangle [puppet] - 10https://gerrit.wikimedia.org/r/314715 (https://phabricator.wikimedia.org/T131503) (owner: 10Ema) [13:57:55] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2701526 (10Ottomata) - Does this mean that a single role can no longer be used by both labs and production? - Modules 'must not use classes from other m... [13:58:20] (03PS2) 10BBlack: vk::webrequest - adjust peak rate estimates [puppet] - 10https://gerrit.wikimedia.org/r/314336 [13:58:36] (03CR) 10BBlack: [C: 032 V: 032] vk::webrequest - adjust peak rate estimates [puppet] - 10https://gerrit.wikimedia.org/r/314336 (owner: 10BBlack) [14:01:04] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:01:20] 06Operations, 05Prometheus-metrics-monitoring: Upgrade mysqld_exporter to 0.9.0 - https://phabricator.wikimedia.org/T147476#2703684 (10fgiunchedi) As a side note, the latest version of prometheus-mysqld-exporter correct sets build information, so the rollout can be also tracked with a prometheus query e.g. `c... [14:01:29] damn puppet... neon puppet is me [14:02:13] (03Abandoned) 10BBlack: VCL backends 1/N [WIP] [puppet] - 10https://gerrit.wikimedia.org/r/300574 (https://phabricator.wikimedia.org/T110717) (owner: 10BBlack) [14:02:20] (03Abandoned) 10BBlack: VCL backends 2/N: sort misc req_handling [puppet] - 10https://gerrit.wikimedia.org/r/300579 (https://phabricator.wikimedia.org/T110717) (owner: 10BBlack) [14:02:25] (03Abandoned) 10BBlack: VCL backends 3/N: add force-pass support [puppet] - 10https://gerrit.wikimedia.org/r/300581 (https://phabricator.wikimedia.org/T110717) (owner: 10BBlack) [14:02:29] (03Abandoned) 10BBlack: VCL backends 4/N: subpaths and defaulting [puppet] - 10https://gerrit.wikimedia.org/r/300655 (owner: 10BBlack) [14:02:32] (03Abandoned) 10BBlack: VCL backends 5/N: use for all clusters [puppet] - 10https://gerrit.wikimedia.org/r/300656 (owner: 10BBlack) [14:03:31] !log reimage maps-test200[34] - T147194 [14:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:04:17] T147194: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194 [14:04:25] (03PS1) 10Alexandros Kosiaris: icinga: Vary on puppet agent version [puppet] - 10https://gerrit.wikimedia.org/r/315093 [14:04:26] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194#2703689 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['maps-test2003.codfw.wmnet'] ``` The log can be found in `/va... [14:04:35] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194#2703690 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['maps-test2004.codfw.wmnet'] ``` The log can be found in `/va... [14:08:03] (03CR) 10Alexandros Kosiaris: [C: 032] icinga: Vary on puppet agent version [puppet] - 10https://gerrit.wikimedia.org/r/315093 (owner: 10Alexandros Kosiaris) [14:08:38] (03CR) 10Filippo Giunchedi: [C: 04-1] Upgrade to 0.1.25 (031 comment) [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/315088 (owner: 10Gilles) [14:10:04] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194#2703694 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['maps-test2004.codfw.wmnet'] ``` Those hosts were successful: ``` [] ``` [14:13:32] !log upgraded PHP on bohrium/piwik.wikimedia.or [14:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:14:56] RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:15:52] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [14:15:52] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [14:18:00] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194#2703699 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['maps-test2004.codfw.wmnet'] ``` The log can be found in `/va... [14:18:03] RECOVERY - puppet last run on mw1208 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [14:19:06] (03PS4) 10Filippo Giunchedi: prometheus: generate varnish targets from conftool [puppet] - 10https://gerrit.wikimedia.org/r/310819 (https://phabricator.wikimedia.org/T147424) [14:22:01] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [14:22:26] (03PS3) 10Muehlenhoff: Set default openldap log level to "sync" [puppet] - 10https://gerrit.wikimedia.org/r/315046 (https://phabricator.wikimedia.org/T147173) [14:22:28] 06Operations, 10DBA: Drop database table "email_capture" from Wikimedia wikis - https://phabricator.wikimedia.org/T57676#2703707 (10Marostegui) 05Open>03Resolved [14:24:04] (03CR) 10Muehlenhoff: [C: 032] Set default openldap log level to "sync" [puppet] - 10https://gerrit.wikimedia.org/r/315046 (https://phabricator.wikimedia.org/T147173) (owner: 10Muehlenhoff) [14:38:27] (03Abandoned) 10MarcoAurelio: Labs configuration for olo.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/312812 (https://phabricator.wikimedia.org/T146612) (owner: 10MarcoAurelio) [14:39:43] PROBLEM - HP RAID on ms-be1025 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [14:40:31] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194#2703747 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['maps-test2003.codfw.wmnet'] ``` Those hosts were successful: ``` ['maps-test2003.codfw.wmnet'] ``` [14:40:33] (03Abandoned) 10MarcoAurelio: Enable $ExtraSignatureNamespaces for all namespaces for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310802 (https://phabricator.wikimedia.org/T145619) (owner: 10MarcoAurelio) [14:41:25] (03PS1) 10Filippo Giunchedi: [WIP] prometheus: generate Varnish targets [puppet] - 10https://gerrit.wikimedia.org/r/315098 (https://phabricator.wikimedia.org/T147424) [14:42:22] RECOVERY - HP RAID on ms-be1025 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [14:42:53] (03CR) 10jenkins-bot: [V: 04-1] [WIP] prometheus: generate Varnish targets [puppet] - 10https://gerrit.wikimedia.org/r/315098 (https://phabricator.wikimedia.org/T147424) (owner: 10Filippo Giunchedi) [14:43:59] hi, still on swat window? [14:44:08] Hello [14:44:20] There is a window in 80 minutes. [14:44:30] jouncebot: next [14:44:31] In 2 hour(s) and 15 minute(s): Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161010T1700) [14:44:41] jouncebot: now [14:44:41] No deployments scheduled for the next 2 hour(s) and 15 minute(s) [14:44:57] (03PS2) 10Filippo Giunchedi: [WIP] prometheus: generate Varnish targets [puppet] - 10https://gerrit.wikimedia.org/r/315098 (https://phabricator.wikimedia.org/T147424) [14:45:07] mafk: timezone issue, in 200 minutes [14:45:22] it's 18:00 UTC, not CEST the morning window [14:46:12] ah, k, I see it at the calendar, I'll add some patches [14:46:15] (03CR) 10jenkins-bot: [V: 04-1] [WIP] prometheus: generate Varnish targets [puppet] - 10https://gerrit.wikimedia.org/r/315098 (https://phabricator.wikimedia.org/T147424) (owner: 10Filippo Giunchedi) [14:49:21] (03PS3) 10Filippo Giunchedi: [WIP] prometheus: generate Varnish targets [puppet] - 10https://gerrit.wikimedia.org/r/315098 (https://phabricator.wikimedia.org/T147424) [14:49:23] (03PS5) 10Filippo Giunchedi: prometheus: generate varnish targets from get_clusters() [puppet] - 10https://gerrit.wikimedia.org/r/310819 (https://phabricator.wikimedia.org/T147424) [14:50:48] (03CR) 10jenkins-bot: [V: 04-1] [WIP] prometheus: generate Varnish targets [puppet] - 10https://gerrit.wikimedia.org/r/315098 (https://phabricator.wikimedia.org/T147424) (owner: 10Filippo Giunchedi) [14:53:30] Error: Could not run: cannot load such file -- puppet/util/puppetdb le sigh [14:53:58] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194#2703780 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['maps-test2004.codfw.wmnet'] ``` Those hosts were successful: ``` ['maps-test2004.codfw.wmnet'] ``` [14:55:42] seen it before? puppet compiler failing like that on puppetdb _joe_ or akosiaris perhaps ? [14:57:45] (03PS7) 10Hashar: (WIP) contint: Sonatype Nexus (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/314751 (https://phabricator.wikimedia.org/T147635) [14:59:32] (03PS1) 10Elukey: Remove old wikistats cron script causing cron-spam [puppet] - 10https://gerrit.wikimedia.org/r/315101 (https://phabricator.wikimedia.org/T145606) [14:59:58] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [15:04:11] that 5xx spike is coming from iridium (phab) [15:04:18] it was very brief though [15:08:15] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [15:09:18] (03PS8) 10Hashar: (WIP) contint: Sonatype Nexus (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/314751 (https://phabricator.wikimedia.org/T147635) [15:10:14] (03PS9) 10Hashar: (WIP) contint: Sonatype Nexus (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/314751 (https://phabricator.wikimedia.org/T147635) [15:10:54] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:15:14] PROBLEM - HP RAID on ms-be1024 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [15:16:57] godog: I've seen a bunch of those socket timeouts from those swift hosts ^^^, should we increase the timeout? (considering that command_timeout=60 on nrpe AFAIK) [15:17:39] doing a couple of time of the check I got ~33-36s, I guess that when a bit overloaded it might easily go over 40 [15:17:46] RECOVERY - HP RAID on ms-be1024 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [15:18:29] volans: heh incidentally I was looking at that too, yeah I don't have a good solution besides increasing the timeout further [15:24:36] godog: besides increasing it to 50 or 55 maybe for the raid check also the max_check_attempts could be improved only for those checks [15:24:43] (03PS1) 10Filippo Giunchedi: raid: increase check_hpssacli timeout [puppet] - 10https://gerrit.wikimedia.org/r/315103 [15:25:01] but I didn't look at the puppet side to check how feasible it is [15:25:14] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [15:25:54] volans: yup, that and checking less often, it might be one minute now IIRC [15:26:07] yes, too [15:27:53] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [15:27:56] (03CR) 10Hashar: "Added a systemd wrapper." [puppet] - 10https://gerrit.wikimedia.org/r/314751 (https://phabricator.wikimedia.org/T147635) (owner: 10Hashar) [15:28:42] godog: retries is a parameter of nrpe::monitor_service, the normal_check_interval and retry_check_interval not AFAICS [15:29:41] (03PS2) 10Elukey: Remove old wikistats cron script causing cron-spam [puppet] - 10https://gerrit.wikimedia.org/r/315101 (https://phabricator.wikimedia.org/T145606) [15:29:58] volans: sigh, indeed, not sure what's the puppet way (tm) to do that [15:30:34] * elukey tries to remove perl packages from stat1001 [15:31:17] godog: should be easy, looks like they are parameters in monitoring::service [15:31:26] that is called by the define nrpe::monitor_service [15:32:28] (03PS1) 10DCausse: Elastic@deployment-prep: force the number of replicas to 1 max [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315104 (https://phabricator.wikimedia.org/T147777) [15:34:06] volans: nothing is easy in puppet :) [15:34:19] true that [15:34:28] shooting yourself in the foot is not that hard to be fair [15:35:35] hehehe [15:35:53] (03PS1) 10Elukey: Revert "Remove kafka1012 from EventLogging brokers array" [puppet] - 10https://gerrit.wikimedia.org/r/315106 [15:36:02] volans: ah the .ctags files in puppet.git is what gives ctags-exuberant magic powers, I just noticed [15:36:18] at least that [15:36:21] oh great, thanks for the heads up [15:36:27] (03CR) 10jenkins-bot: [V: 04-1] Revert "Remove kafka1012 from EventLogging brokers array" [puppet] - 10https://gerrit.wikimedia.org/r/315106 (owner: 10Elukey) [15:39:21] (03PS2) 10Elukey: Revert "Remove kafka1012 from EventLogging brokers array" [puppet] - 10https://gerrit.wikimedia.org/r/315106 [15:40:35] (03PS3) 10Elukey: Revert "Remove kafka1012 from EventLogging brokers array" [puppet] - 10https://gerrit.wikimedia.org/r/315106 [15:41:09] straw man coming up [15:41:20] (03PS1) 10Filippo Giunchedi: raid: tweak check_interval for forking checks [puppet] - 10https://gerrit.wikimedia.org/r/315107 [15:43:57] (03PS4) 10Elukey: Revert "Remove kafka1012 from EventLogging brokers array" [puppet] - 10https://gerrit.wikimedia.org/r/315106 [15:45:04] (03CR) 10jenkins-bot: [V: 04-1] Revert "Remove kafka1012 from EventLogging brokers array" [puppet] - 10https://gerrit.wikimedia.org/r/315106 (owner: 10Elukey) [15:45:59] yes sorry Jenkins [15:50:37] !log mathoid deploying adb8e548 [15:50:38] godog: 1h is not a bit too much? [15:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:52:00] (03PS3) 10Filippo Giunchedi: standard: add prometheus node_exporter in codfw [puppet] - 10https://gerrit.wikimedia.org/r/310519 (https://phabricator.wikimedia.org/T140646) [15:52:04] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [15:53:05] volans: IIRC we're running with check_interval 60s, that'd be 10m [15:53:23] (03PS5) 10Elukey: Revert "Remove kafka1012 from EventLogging brokers array" [puppet] - 10https://gerrit.wikimedia.org/r/315106 [15:53:25] yes, each, but retry 6 means that it will alarm after 1h [15:53:33] IRRC [15:53:34] IIRC [15:53:49] <_joe_> IRCC [15:55:07] IIRC (!) retry_check_interval is how often to check when a non-ok status is detected [15:55:09] * _joe_ goes back to docker building [15:55:27] yep [15:55:47] ah sorry, I misread it [15:56:02] I was thinking about attemprs [15:56:04] *attempts [15:56:42] that could be set too, is retries in the code [15:57:48] yeah it is confusing, but to your point IMO we could also be checking once an hour or less often [15:57:54] (03CR) 10Volans: "See inline, style suggestion" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/315107 (owner: 10Filippo Giunchedi) [15:58:53] tbh, 1h is perfectly fine for me for any redundant RAID configuration [15:59:40] for the few cases in which we have non-redundant ones it will be nice to know the disk broke sooner given that probably we already got another alarm [15:59:57] but I guess we can figure it out anyway quickly from the logs of the service that is using those disks [16:01:06] indeed, some other alarm would have likely tripped anyway [16:04:50] (03CR) 10Ottomata: [C: 031] Revert "Remove kafka1012 from EventLogging brokers array" [puppet] - 10https://gerrit.wikimedia.org/r/315106 (owner: 10Elukey) [16:05:22] ACKNOWLEDGEMENT - puppet last run on ms-be1021 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 14 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[parted-/dev/sdn] Filippo Giunchedi disk to be added back, see T139767 [16:12:00] (03CR) 10Jdlrobson: "Thanks for making this happen :)" (039 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314748 (https://phabricator.wikimedia.org/T147234) (owner: 10Dereckson) [16:16:55] _joe_: can I assign https://phabricator.wikimedia.org/T147800 to you? [16:17:08] (03PS2) 10Andrew Bogott: Add base::firewall to role::eventbus::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/314338 [16:17:40] 06Operations, 06Discovery, 06Discovery-Analysis: Can't install R package Boom (& bsts) on stat1002 (but can on stat1003) - https://phabricator.wikimedia.org/T147682#2703962 (10elukey) @Ottomata any objection? [16:17:48] (03PS2) 10Filippo Giunchedi: raid: tweak check_interval for forking checks [puppet] - 10https://gerrit.wikimedia.org/r/315107 [16:17:50] (03CR) 10Filippo Giunchedi: raid: tweak check_interval for forking checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/315107 (owner: 10Filippo Giunchedi) [16:19:16] (03CR) 10Andrew Bogott: [C: 032] Add base::firewall to role::eventbus::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/314338 (owner: 10Andrew Bogott) [16:19:50] (03PS4) 10Filippo Giunchedi: [WIP] prometheus: generate Varnish targets [puppet] - 10https://gerrit.wikimedia.org/r/315098 (https://phabricator.wikimedia.org/T147424) [16:20:54] How come it says "Failed to fetch notifications." when I try to look at my notifications for Wikivoyage? [16:21:06] It's strange.. [16:21:36] anything in console? [16:22:18] But if I click on "Mark as read" for the selected wiki, they show. [16:22:25] Weird.. [16:22:52] * Bsadowski1 slaps bugs [16:22:58] Bsadowski1? [16:23:04] maybe a bad cookie that fixed when clicking on mark as read? [16:23:10] Not sure. [16:23:30] you don't know if there is anything in console? [16:23:35] Im not 100% sure on how the new notification system works :/ [16:24:04] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [16:29:22] Well, I do see many unrelated events in there.. [16:29:25] "Use of "$j" is deprecated. Use $ or jQuery instead." [16:29:45] "Use of "EmbedPlayer.EnableIpadNativeFullscreen" is deprecated. Use mw.config instead." [16:30:11] Hmm [16:34:07] (03PS2) 10Muehlenhoff: videoscalers: Restrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/314699 [16:34:17] (03CR) 10Elukey: "We could also think about removing all the perl packages, Erik does not recognize anyone of them." [puppet] - 10https://gerrit.wikimedia.org/r/315101 (https://phabricator.wikimedia.org/T145606) (owner: 10Elukey) [16:34:28] I'm not seeing anything... [16:35:49] (03CR) 10Muehlenhoff: [C: 032] videoscalers: Restrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/314699 (owner: 10Muehlenhoff) [16:37:21] 06Operations, 06Labs, 13Patch-For-Review: Phase out the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#1846901 (10Andrew) When I tried to use role::puppetmaster::standalone last week I was unable to start apache on the new pup... [16:40:48] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment: setup wmf4747/wmf4748/wmf4749/wmf4750 for temp kubernetes testing - https://phabricator.wikimedia.org/T146171#2703992 (10Joe) I might start using them in a near future, in that case, I'll reimage them. [16:43:04] Hmm seems it was fixed?: https://phabricator.wikimedia.org/T139112 [16:43:12] Well, that was for something different [16:43:38] Sort of similar to what I am seeing [16:44:30] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Prepare and check production and labs-side filtering for olowiki - https://phabricator.wikimedia.org/T147302#2704003 (10jcrespo) No it is not, it is not available on labs- and it should not be until this is resolved. [16:45:59] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 15User-Joe: Investigate ways to deploy docker to production - https://phabricator.wikimedia.org/T147402#2704008 (10Joe) So, after extended trial-and-error, the state of my work is as follows: - I am backporting the docker.io package for... [16:46:35] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Prepare and check production and labs-side filtering for olowiki - https://phabricator.wikimedia.org/T147302#2704009 (10jcrespo) @Marostegui we should do this tomorrow, with special guest @chasemp , if he wants. [16:48:37] (03PS1) 10Dereckson: Enable signature button in Portal: on de.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315119 (https://phabricator.wikimedia.org/T145619) [16:48:52] James_F: mafk: I suggest this change for de.wikipedia signature button ^ [16:49:30] That will mostly avoid the antifeature pattern, no possibility to add a signature in (main) [16:49:51] Didn't they want for all namespaces? [16:50:04] which is a bit of ah, emh... weird? [16:50:16] because signature on Module: does not make much sense [16:50:20] for example [16:50:59] to be fair they arent as big as a wiki as enwiki so they can easily make such changes :P [16:50:59] https://phabricator.wikimedia.org/T145619#2658326 [16:51:18] Zppix: consider fr and de as big wikis too [16:51:58] correct, but if someone tried changing such on enwiki all hell would freeze over because of its size de and fr not so much [16:52:29] mafk: to enable it on Portal = 23 pro, 2 contra, https://de.wikipedia.org/wiki/Wikipedia:Umfragen/Unterschriften-Icon_in_Bearbeiten-Werkzeugleiste#Ich_bin_f.C3.BCr_die_Wiederherstellung_des_Unterschriften-Icons_der_Werkzeugleiste_im_Portalnamensraum [16:52:52] Zppix: en.wikipedia did wilder proposals like a new level of protection [16:53:25] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [16:54:29] Dereckson actually that was office from my understanding [16:55:52] Superprotect was [16:56:00] Other levels have been provided by other people [16:56:03] 30/50? [16:56:04] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [16:56:22] (03PS1) 10MarcoAurelio: Stop adding "Category:Uploaded with UploadWizard" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315121 (https://phabricator.wikimedia.org/T147799) [16:56:36] PROBLEM - HP RAID on ms-be1024 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [16:56:47] yep, extendedconfirmed [16:56:58] next will be extended extendedconfirmed [16:57:35] mafk actually ext confirm i've never really seen used ik its used on like 2-3 articles though [16:59:02] (03PS1) 10Elukey: Add documentation to systemd::syslog and an optional filepath variable [puppet] - 10https://gerrit.wikimedia.org/r/315122 [16:59:14] RECOVERY - HP RAID on ms-be1024 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [17:00:04] gehel: Respected human, time to deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161010T1700). Please do the needful. [17:00:07] (03CR) 10jenkins-bot: [V: 04-1] Add documentation to systemd::syslog and an optional filepath variable [puppet] - 10https://gerrit.wikimedia.org/r/315122 (owner: 10Elukey) [17:00:21] nothing scheduled for WDQS today... [17:00:34] (03CR) 10MarcoAurelio: "Not sure if just leaving it empty or removing it entirely in light of (03PS2) 10Elukey: Add documentation to systemd::syslog and an optional filepath variable [puppet] - 10https://gerrit.wikimedia.org/r/315122 [17:00:42] the bot doesnt agree xD it wants you to work xD [17:01:46] (03CR) 10jenkins-bot: [V: 04-1] Add documentation to systemd::syslog and an optional filepath variable [puppet] - 10https://gerrit.wikimedia.org/r/315122 (owner: 10Elukey) [17:01:48] (03PS3) 10Elukey: Add documentation to systemd::syslog and an optional filepath variable [puppet] - 10https://gerrit.wikimedia.org/r/315122 [17:02:48] (03CR) 10jenkins-bot: [V: 04-1] Add documentation to systemd::syslog and an optional filepath variable [puppet] - 10https://gerrit.wikimedia.org/r/315122 (owner: 10Elukey) [17:04:27] (03PS4) 10Elukey: Add documentation to systemd::syslog and an optional filepath variable [puppet] - 10https://gerrit.wikimedia.org/r/315122 [17:04:33] Jenkins I am stopping after this one don't worry [17:04:42] you have already made me cry too many times [17:09:25] (03CR) 10MarcoAurelio: Stop adding "Category:Uploaded with UploadWizard" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315121 (https://phabricator.wikimedia.org/T147799) (owner: 10MarcoAurelio) [17:18:36] how crazy would it be to create a wmf-mariadb package without a systemd unit, assuming that it always going to be provided by puppet? [17:19:47] there are many thing that systemd takes away from mysqld_safe configuration, and it is likely going to change from service to service and with time [17:20:49] !log upgraded maps1* to postgis 2.3.0 - T144763 [17:20:51] T144763: Upgrade PostGIS on maps servers to version 2.2+ - https://phabricator.wikimedia.org/T144763 [17:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:22:05] jynus: sounds good for an internal package, just make sure not to provide a init.d script systemd can fallback to [17:22:20] no, we already had one [17:22:32] I just have to stop adding it [17:22:45] or keep it on "share/doc" [17:26:19] <_joe_> meh, no luck in compiling docker today. [17:26:28] <_joe_> see you tomorrow! [17:33:05] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [17:35:56] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [17:44:16] (03CR) 10Ottomata: "I think we should get rid of this class, and move pigz package to statistics::cruncher" [puppet] - 10https://gerrit.wikimedia.org/r/315101 (https://phabricator.wikimedia.org/T145606) (owner: 10Elukey) [17:46:36] (03PS1) 10Thcipriani: Scap: modify deploy-local arguments [puppet] - 10https://gerrit.wikimedia.org/r/315139 (https://phabricator.wikimedia.org/T146602) [17:47:38] (03CR) 10jenkins-bot: [V: 04-1] Scap: modify deploy-local arguments [puppet] - 10https://gerrit.wikimedia.org/r/315139 (https://phabricator.wikimedia.org/T146602) (owner: 10Thcipriani) [17:48:55] (03CR) 10Muehlenhoff: "Or add "pigz" to standard_packages, it's a tiny tool without any additional deps outside of the Debian base set and seems to be generally " [puppet] - 10https://gerrit.wikimedia.org/r/315101 (https://phabricator.wikimedia.org/T145606) (owner: 10Elukey) [17:49:24] (03PS5) 10Krinkle: static.php should use deployed branch for invalid hashes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312254 (https://phabricator.wikimedia.org/T146363) (owner: 10Brion VIBBER) [17:52:25] PROBLEM - puppet last run on bast3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:53:08] (03CR) 10Volans: [C: 031] "LGTM, thanks for taking care of this! It should also slightly reduce the overload of neon." [puppet] - 10https://gerrit.wikimedia.org/r/315107 (owner: 10Filippo Giunchedi) [17:53:43] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/315103 (owner: 10Filippo Giunchedi) [17:58:53] jouncebot: next [17:58:53] In 0 hour(s) and 1 minute(s): Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161010T1800) [17:59:54] Hi MatmaRex. [18:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161010T1800). Please do the needful. [18:00:05] MatmaRex: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [18:00:38] I can SWAT. [18:01:04] falsy across languages is always fun [18:01:16] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] [18:01:30] MatmaRex: I'm going to look at this before deploy ^ [18:01:41] I added a patch too [18:01:53] Amir1: acked [18:01:54] jouncebot: bad robot! [18:02:16] Amir1: patches added some minutes before the window aren't announced by jouncebot [18:02:48] fatal doesn't show anything unusual, 144 proc line: 2959: warning: points must have either 4 or 2 values per line [18:03:00] https://phabricator.wikimedia.org/T138036 [18:03:21] Okay we can deploy. [18:03:32] yeah, I think jouncebot should recheck that too (or I do it a little bit sooner :D) [18:04:41] i think you can say "jouncebot: refresh" to make it recheck the page [18:04:46] Amir1: when do you plan to run the CleanDuplicateScores script ? [18:04:53] jouncebot: refresh [18:04:55] jouncebot: now [18:04:56] I refreshed my knowledge about deployments. [18:04:56] For the next 0 hour(s) and 55 minute(s): Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161010T1800) [18:05:13] Dereckson: the services deployment window (in one hour) [18:05:42] * Dereckson nods. [18:05:43] MatmaRex: nice, thanks [18:06:29] Amir1: CR +2'ed them, we wait Zuul [18:06:44] Dereckson: thanks [18:06:45] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [18:07:50] MatmaRex: live on mw1099 [18:08:14] Dereckson: thanks, let me try on testwiki [18:08:28] 06Operations, 10Mail, 10OTRS, 10Wiki-Loves-Monuments: E-mails not being received by OTRS - https://phabricator.wikimedia.org/T145293#2704101 (10Ciell) Is some-one still working on this? [18:08:29] (testwiki requires also X-Wikimedia-Debug) [18:09:12] Amir1: Jenkins jobs ok for ORES change, 4 remaining for core [18:09:42] the ores one does need to be tested [18:09:47] can't be tested [18:09:57] but the core one can be tested in mw1099 [18:11:11] we're waiting https://integration.wikimedia.org/ci/job/mediawiki-extensions-php55/8516/ [18:11:22] Dereckson: are you sure it's live? it doesn't seem to be fixed on mw1099 [18:12:38] yes, it is [18:12:48] i still see the old code. grumble [18:12:56] add ?debug=true [18:13:06] oh and test. isn't enough, you MUST use X-Wikimedia-Debut too [18:13:14] yes, i have the extension [18:13:34] it's supposed to work without debug=true. testing with it is cheating [18:13:41] by the way checking the file on mw1099, I noticed an incoherecne: mullie condition is warnings.length > 0, if a little above we've a warning.length [18:13:51] if ( warnings.length ) { // One of the DetailsWidgets has warnings ... } [18:14:33] it works the same. maybe it's a bit messy, i didn't notice. [18:15:49] Dereckson: eh, alright, i see the right version without debug=true now too. i guess it was cached somewhere [18:17:29] I think the extension should add that debug too, maybe header? [18:18:13] Amir1: live on mw1099 [18:18:23] the core patch? [18:18:25] both [18:18:35] and I acked script fix can't be tested [18:18:48] okay [18:18:55] RECOVERY - puppet last run on bast3001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:19:09] ah debug=true and X-Wikimedia-Debug aren't identical [18:19:29] the second is to direct the query to a specific server or request logging/profile [18:19:47] yeah, I know I think the extension should do both [18:20:02] if debug=true could be somewhat added to the header [18:20:09] the first is to instruct resource loader not to do some caching, minimizing and packing operations [18:20:12] https://www.mediawiki.org/wiki/ResourceLoader/Features#Debug_mode [18:20:42] you probably want to know how your JS code behaves in cached mode [18:20:44] not in debug mode [18:20:48] Dereckson: yup it's fixed [18:20:53] https://www.wikidata.org/wiki/Q7251?action=info&debug= [18:20:58] oh a cookie allows it [18:21:18] (links in "Wikis subscribed to this entity" are okay) [18:21:20] (to add an header would be useful too) [18:22:06] !log dereckson@tin Started scap: file php-1.28.0-wmf.21/includes/Linker.php Do not normalise external links to special pages (T147685) [18:22:07] T147685: Interwiki links doesn't work in entity subscription in repo - https://phabricator.wikimedia.org/T147685 [18:22:09] !log dereckson@tin scap aborted: file php-1.28.0-wmf.21/includes/Linker.php Do not normalise external links to special pages (T147685) (duration: 00m 03s) [18:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:23:22] !log dereckson@tin Synchronized php-1.28.0-wmf.21/includes/Linker.php: Do not normalise external links to special pages (T147685) (duration: 01m 07s) [18:23:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:23:33] one file is enough :) [18:23:40] (03CR) 10Ottomata: "+1 to standard_packages" [puppet] - 10https://gerrit.wikimedia.org/r/315101 (https://phabricator.wikimedia.org/T145606) (owner: 10Elukey) [18:23:53] (03CR) 10Zhuyifei1999: "This is CommonSettings where changes affect all wikis with UW enabled, when consensus is established for Commons only; sure? (I haven't ch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315121 (https://phabricator.wikimedia.org/T147799) (owner: 10MarcoAurelio) [18:25:17] !log dereckson@tin Synchronized php-1.28.0-wmf.21/extensions/ORES/maintenance/CleanDuplicateScores.php: Fixup maintenance/CleanDuplicateScores.php (duration: 00m 54s) [18:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:25:30] Amir1: core and ORES changes synced to prod [18:25:49] the core one works like charm [18:25:51] thank you [18:25:55] amazing like always [18:26:00] jdlrobson: Undefined index: mobile-license in /srv/mediawiki/php-1.28.0-wmf.21/extensions/MobileFrontend/includes/skins/MinervaTemplate.php on line 99 / Undefined index: footer-site-heading-html in /srv/mediawiki/php-1.28.0-wmf.21/extensions/MobileFrontend/includes/skins/MinervaTemplate.php on line 98 [18:26:35] You're welcome. [18:27:10] anything i can help with (im actually getting quite bored atm :P) [18:27:52] Zppix: what are your interests? [18:28:18] MatmaRex: works fine? [18:28:24] Dereckson: yeah [18:28:33] PHP, python, and QA, and other bug smashing things basically anything mediawiki related xD [18:28:40] MatmaRex: syncing [18:29:26] !log dereckson@tin Synchronized php-1.28.0-wmf.21/extensions/UploadWizard/resources/controller/uw.controller.Details.js: Don't show warning confirmation dialog when there are no warnings (T147659) (duration: 00m 48s) [18:29:28] T147659: Warning dialog ("We recommend that you properly fill in all the fields.") is displayed even if there are no warnings in the form - https://phabricator.wikimedia.org/T147659 [18:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:31:30] Zppix: some communities site requests blocked by development: https://phabricator.wikimedia.org/T119827 https://phabricator.wikimedia.org/T42459 https://phabricator.wikimedia.org/T24097 (the first two in PHP, the last PHP or Python with a fun idea to use https://en.wikipedia.org/wiki/Normal_distribution to estimate dates) [18:37:30] 06Operations, 10Mail, 10OTRS, 10Wiki-Loves-Monuments: E-mails not being received by OTRS - https://phabricator.wikimedia.org/T145293#2704147 (10Platonides) I don't think so. Who do you think should act on this, given the above information? [18:37:42] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [18:40:21] (03CR) 10Gilles: Upgrade to 0.1.25 (031 comment) [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/315088 (owner: 10Gilles) [18:41:00] for graphite1001 esmas warning, it's still 534 proc line: 2959: warning: points must have either 4 or 2 values per line [18:41:06] Where are they wanting T119827 to be located Dereckson? [18:41:09] T119827: protectionLevels says that the current title is protected if it does not exist - https://phabricator.wikimedia.org/T119827 [18:41:14] no sorry [18:41:16] (03CR) 10Gilles: [C: 04-1] "Thanks to Filippo TIL that /var/run lives in tmpfs. We really can't have all those files live in memory. Instead I will try to solve the s" [puppet] - 10https://gerrit.wikimedia.org/r/315062 (owner: 10Gilles) [18:41:17] t424259 [18:50:44] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [18:53:41] Amir1: 2 Undefined index: width in /srv/mediawiki/php-1.28.0-wmf.21/includes/Linker.php on line 752 / 2 Undefined index: width in /srv/mediawiki/php-1.28.0-wmf.21/includes/Linker.php on line 750 [18:54:10] I don't think that's related [18:54:40] https://gerrit.wikimedia.org/r/#/c/315142/1/includes/Linker.php [18:54:53] it only touches lines 320 [18:56:14] You call $target->isExternal and the notice occurs in Linker::processResponsiveImages, yes unrelated. [18:58:25] Amir1: an unrelated and already known bug, https://phabricator.wikimedia.org/T138987 [18:58:47] oh thanks [19:03:55] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [19:05:11] (03CR) 10Krinkle: [C: 031] Update gallery image bounding box on svwiki to 150x150 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304991 (https://phabricator.wikimedia.org/T113877) (owner: 10Gilles) [19:11:09] Krenair: gilles: there is an old similar request from he.wikipedia [19:19:46] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Prepare and check production and labs-side filtering for olowiki - https://phabricator.wikimedia.org/T147302#2704247 (10Marostegui) Sounds good to me, let's do it tomorrow! El 10 oct. 2016 18:46, "jcrespo" escribió:... [19:20:38] Dereckson, ? [19:21:08] tab error, I wanted to say Krinkle [19:40:11] (03PS8) 10Hashar: zuul: refactor to use hiera [puppet] - 10https://gerrit.wikimedia.org/r/308778 (https://phabricator.wikimedia.org/T139527) [19:40:13] (03PS3) 10Hashar: zuul: migrate server only settings out of merger [puppet] - 10https://gerrit.wikimedia.org/r/309299 [19:40:15] (03PS1) 10Hashar: contint: install jenkins+CI site on contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/315146 [19:45:16] (03CR) 10Hashar: "https://puppet-compiler.wmflabs.org/4263/contint1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/315146 (owner: 10Hashar) [19:55:13] (03PS3) 10Paladox: phabricator: Reduce innodb_ft_min_token_size from 3 to 1 [puppet] - 10https://gerrit.wikimedia.org/r/315057 (https://phabricator.wikimedia.org/T146673) [20:00:04] gwicke, cscott, arlolra, subbu, bearND, mdholloway, halfak, Amir1, and yurik: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161010T2000). Please do the needful. [20:00:25] We have a deployment of ORES and running a maintenance script [20:00:34] both are straightforward [20:02:24] Amir1: like now ? [20:02:38] Amir1: there is no one from ops around since US is on a day off [20:02:57] It was a services window so I thought people are around [20:03:14] if Ops are not around it can wait for a day [20:03:20] I'm here but it's well outside my tz [20:03:20] depends on the change, but you are more or less on your own :D [20:03:25] (11 pm) sorry.... [20:03:36] we should probably have cleared out the slots for that days [20:03:37] day [20:03:41] tomorrow everone will be back on schedule [20:03:51] yeah that would have been good [20:04:10] okay, it's not that important, we are just shutting up ORES: https://phabricator.wikimedia.org/T146680 [20:04:12] Dereckson: around ? [20:04:18] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [20:04:25] oh is this the spammy logs from INFO level logging? [20:04:27] * apergos looks [20:04:44] it was in warning level but we are moving it to error level [20:04:55] Amir1: ahhh [20:05:01] Amir1: https://gerrit.wikimedia.org/r/#/c/314843/1/logging_config.yaml ? :] [20:05:02] I did moving from info to warn several days ago [20:05:11] hashar: yeah, that :D [20:05:15] right [20:05:17] hashar: yes, I'm [20:05:21] Amir1: go for it :] [20:05:34] I'm willing to be around for that one for the next little bit [20:05:35] Dereckson: there will probably nobody available at the evening swat. it is a day of in the US today [20:05:39] But I also want to run a maintenance script, for that I will do it tomorrow [20:05:43] hashar: thanks! [20:05:50] apergos: awesome [20:05:58] the maintenance script, not so much though [20:06:08] Amir1: make sure that whatever you deploy is actually just that code [20:06:13] exactly [20:06:15] Amir1: in case the /deploy repo got updated in between :] [20:06:24] yeah [20:06:37] I keep a strict eye on deploy repo [20:06:47] Dereckson: but I guess you can self deploy cant you? [20:06:50] and since I usually deploy, I know the last commit by heart [20:07:02] check it just in case :-P [20:07:07] Dereckson: if you want we can do it right now, instead of late in the night :] [20:07:24] definitely [20:07:31] hashar: as you like [20:07:32] Amir1: you have my blessing :] [20:07:59] Thank you [20:08:02] it's a trivial config change easy to test [20:08:02] :) [20:08:08] PROBLEM - puppet last run on elastic1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:08:48] yup, that's the only commit in between [20:08:55] (I'm in tin, deploying) [20:09:03] okay [20:09:19] Dereckson: lets do it now [20:09:35] (03CR) 10Hashar: [C: 031] "This is all fine. We can get it deployed now :]" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309912 (https://phabricator.wikimedia.org/T144689) (owner: 10Odder) [20:09:38] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [20:09:44] !log deploying 8bbd3ab to ores canary nodes (T146680) [20:09:45] T146680: Quiet result.get Warning in tasks - https://phabricator.wikimedia.org/T146680 [20:09:47] Dereckson: same please proceed if you want :] [20:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:09:57] k [20:09:59] can Dereckson wait 10 minutes? [20:10:02] sure [20:10:08] just so that the ores change goes through and is verified to be okay [20:10:11] awesome [20:10:12] apergos: ping me when you're done [20:10:18] yep [20:10:52] what i like with ops is that they are always way more careful than I am :] [20:11:03] (03CR) 10MarcoAurelio: "> This is CommonSettings where changes affect all wikis with UW" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315121 (https://phabricator.wikimedia.org/T147799) (owner: 10MarcoAurelio) [20:11:05] is that going to result in a restart of each ores service btw? [20:11:16] apergos: yes [20:11:19] hashar: it will be my *ss in a sling (or my lack of sleep) otherwise :-P [20:11:23] ah hm ok [20:11:27] well we'll see how that goes [20:11:44] deployed in canaries [20:12:05] how many of those are there, and how do the logs look? [20:12:44] scb1002 is the canary [20:13:02] oh there's only sbc1001 and 2, right? [20:13:04] scb1001, scb2001 and scb2002 are not [20:13:09] oh right codfw [20:13:18] are the codfw ones getting real traffic? [20:13:19] but they don't get traffic [20:13:24] heh there ya go [20:13:47] we had one of codfw as canary [20:14:00] but after one incident we decided to use one of real nodes [20:14:06] but if it doesn't see real traffic it's not as useful [20:14:12] good move [20:14:12] in case something breaks, half of traffic returns error [20:14:36] https://grafana.wikimedia.org/dashboard/db/ores-extension [20:14:37] so how are those logs (and the service) looking? [20:14:39] https://grafana.wikimedia.org/dashboard/db/ores [20:15:05] when it was in codfw, we manually logged in and tested internally but wasn't super useful [20:15:43] yep [20:15:44] I'm monitoring grafana jobs and failure ratio, error ratio, etc. for the past ten minutes [20:15:59] great [20:16:12] everything looks okay [20:16:15] going for all [20:16:18] +1 [20:16:19] :) [20:16:41] !log deploying 8bbd3ab to all ores nodes (T146680) [20:16:42] T146680: Quiet result.get Warning in tasks - https://phabricator.wikimedia.org/T146680 [20:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:17:48] PROBLEM - puppet last run on analytics1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:19:00] waiting for them to get back up to speed (or at least show it on the graphs) [20:20:11] restarting the services [20:22:34] deployment is done [20:22:40] everything looks ok [20:22:48] all right, let's give it another 10 mins and if [20:22:58] nothing's exploded by then we'll call it good [20:23:12] logs still look good (less in em)? [20:23:31] apergos: looks good to me [20:23:39] (03CR) 10Bartosz Dziewoński: "A few other wikis use UploadWizard (testwiki, and I think rowiki or something?). Commons-specific customizations of UploadWizard config ar" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315121 (https://phabricator.wikimedia.org/T147799) (owner: 10MarcoAurelio) [20:23:52] apergos: they are okay [20:24:09] all right. I'm watching the graphs for a bit [20:25:10] thanks [20:26:51] (03PS2) 10Thcipriani: Scap: modify deploy-local arguments [puppet] - 10https://gerrit.wikimedia.org/r/315139 (https://phabricator.wikimedia.org/T146602) [20:26:52] Dereckson: lets do the commons bureaucrat change :) [20:27:57] (03CR) 10jenkins-bot: [V: 04-1] Scap: modify deploy-local arguments [puppet] - 10https://gerrit.wikimedia.org/r/315139 (https://phabricator.wikimedia.org/T146602) (owner: 10Thcipriani) [20:29:10] accountcreator? [20:30:36] mafk: yes [20:30:41] k [20:31:07] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:31:22] (03PS2) 10Dereckson: Allow Commons 'crats to manage accountcreator group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309912 (https://phabricator.wikimedia.org/T144689) (owner: 10Odder) [20:31:44] mafk: odder is away this week [20:31:47] RECOVERY - puppet last run on elastic1031 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [20:32:32] (03CR) 10Dereckson: [C: 032] Allow Commons 'crats to manage accountcreator group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309912 (https://phabricator.wikimedia.org/T144689) (owner: 10Odder) [20:33:04] (03Merged) 10jenkins-bot: Allow Commons 'crats to manage accountcreator group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309912 (https://phabricator.wikimedia.org/T144689) (owner: 10Odder) [20:33:11] apergos: may I proceed? [20:33:17] the only thing I don't like is the tiny little spike in response time [20:33:23] yes [20:33:36] hashar: you are babysitting this one, right? [20:34:36] 309912 is live on mw1099 [20:34:51] apergos: yeah [20:34:54] apergos: get some sleep :] [20:35:00] Works. [20:35:02] apergos: really Dereckson can handle those changes all fine [20:35:11] will just stay around in case something weird happen :D [20:35:26] good, you never know, hhvm could go all weird or something [20:35:45] Dereckson: didn't know, sorry for slow reply, I'm doing a crossword [20:35:58] thanks. I'm gonna drift off then, it's early but I seem to have got the cold that's going around so hoping to kick it out the door by getting good sleep [20:36:05] Good night. [20:36:10] apergos Have a good one [20:36:20] apergos: thank you for the ORES babysitting :]  And kudos Amir1 ! [20:36:31] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Allow Commons 'crats to manage accountcreator group (T144689) (duration: 00m 50s) [20:36:31] T144689: Allow Commons bureaucrats to manage members of the accountcreator user group - https://phabricator.wikimedia.org/T144689 [20:36:33] hashar: thank you! [20:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:36:36] see yas, happy to chip in my .02€ [20:36:47] apergos: thanks for keeping on eye on ORES [20:37:13] Amir1: sure. just have an eye on the reponse time, I don't know why that's not dropping back down a bit [20:37:20] everything else looks pretty boring :-) [20:37:31] the busy queue [20:37:32] ? [20:37:34] Done. Works fine in prod too. [20:37:47] Dereckson: \o/ [20:38:48] !log Dropped oauth tables from wikitech [20:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:38:59] response time (change prop) [20:39:22] !log Created up to date oauth tables on wikitech [20:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:40:41] (03PS3) 10Thcipriani: Scap: modify deploy-local arguments [puppet] - 10https://gerrit.wikimedia.org/r/315139 (https://phabricator.wikimedia.org/T146602) [20:41:04] Reedy: maybe add oauth for gerrit too? [20:42:20] mafk: Not my responsibility :) [20:44:06] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [20:44:17] RECOVERY - puppet last run on analytics1048 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [20:46:02] it is bed time for me *wave* [20:46:24] good night hashar [20:46:31] night! [20:46:46] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [20:46:47] night hashar [20:46:57] response time now looking pretty normal. so zzzzzz.... :-) [20:47:25] 06Operations, 10Continuous-Integration-Infrastructure, 10Traffic, 07Regression: Favicon broken on doc.wikimedia.org and integration.wikimedia.org (HTTP 500) - https://phabricator.wikimedia.org/T147814#2704427 (10Krinkle) p:05Triage>03Normal Requesting from Varnish with `--compressed` includes it, reque... [20:47:29] (03PS1) 10Reedy: Re-enable OAuth on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315185 (https://phabricator.wikimedia.org/T147804) [20:51:06] PROBLEM - puppet last run on ms-fe1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:53:42] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint, 13Patch-For-Review: Unmet dependencies around postgis apt packages on maps* servers - https://phabricator.wikimedia.org/T147780#2704435 (10Gehel) 05Open>03Resolved [20:55:10] (03PS4) 10Thcipriani: Beta: Clean puppetmaster cherry-picks [puppet] - 10https://gerrit.wikimedia.org/r/310719 (https://phabricator.wikimedia.org/T135427) [20:55:58] (03PS2) 10Reedy: Re-enable OAuth on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315185 (https://phabricator.wikimedia.org/T147804) [20:57:53] (03PS5) 10Thcipriani: Beta: Clean puppetmaster cherry-picks [puppet] - 10https://gerrit.wikimedia.org/r/310719 (https://phabricator.wikimedia.org/T135427) [20:58:45] (03PS3) 10Gilles: Make thumbor use a temp folder controlled by systemd-tmpfiles instead of /tmp [puppet] - 10https://gerrit.wikimedia.org/r/315062 [20:59:01] (03PS4) 10Gilles: Make thumbor use a temp folder controlled by systemd-tmpfiles instead of /tmp [puppet] - 10https://gerrit.wikimedia.org/r/315062 [20:59:43] (03CR) 10Paladox: [C: 031] Re-enable OAuth on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315185 (https://phabricator.wikimedia.org/T147804) (owner: 10Reedy) [21:00:04] dapatrick and bawolff: Respected human, time to deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161010T2100). Please do the needful. [21:00:06] PROBLEM - HP RAID on ms-be1025 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [21:02:44] RECOVERY - HP RAID on ms-be1025 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [21:06:46] (03CR) 10Reedy: [C: 032] Re-enable OAuth on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315185 (https://phabricator.wikimedia.org/T147804) (owner: 10Reedy) [21:07:10] (03Merged) 10jenkins-bot: Re-enable OAuth on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315185 (https://phabricator.wikimedia.org/T147804) (owner: 10Reedy) [21:08:49] !log reedy@tin Synchronized wmf-config/: Re-enable OAuth on Wikitech T147804 (duration: 00m 52s) [21:08:51] T147804: Install OAuth for wikitech - https://phabricator.wikimedia.org/T147804 [21:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:09:48] (03CR) 10Thcipriani: "@Filippo thanks for the review!" (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/310719 (https://phabricator.wikimedia.org/T135427) (owner: 10Thcipriani) [21:10:10] (03PS2) 10Reedy: Remove wikimania2013wiki specific translate config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314742 [21:10:22] (03CR) 10Reedy: [C: 032] Remove wikimania2013wiki specific translate config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314742 (owner: 10Reedy) [21:10:50] (03Merged) 10jenkins-bot: Remove wikimania2013wiki specific translate config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314742 (owner: 10Reedy) [21:11:00] (03PS2) 10Reedy: Remove spurious transcoding-labs.org usage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314753 [21:11:08] (03CR) 10Reedy: [C: 032] Remove spurious transcoding-labs.org usage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314753 (owner: 10Reedy) [21:11:34] (03Merged) 10jenkins-bot: Remove spurious transcoding-labs.org usage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314753 (owner: 10Reedy) [21:12:40] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:12:47] !log reedy@tin Synchronized wmf-config/CommonSettings.php: Remove some legacy cruft that is unused (duration: 00m 50s) [21:12:48] (03PS3) 10Reedy: wfLoadExtension for many more extensions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314746 (https://phabricator.wikimedia.org/T140852) [21:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:15:00] RECOVERY - puppet last run on ms-fe1001 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [21:15:19] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [21:15:36] (03CR) 10Reedy: [C: 032] wfLoadExtension for many more extensions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314746 (https://phabricator.wikimedia.org/T140852) (owner: 10Reedy) [21:16:05] (03Merged) 10jenkins-bot: wfLoadExtension for many more extensions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314746 (https://phabricator.wikimedia.org/T140852) (owner: 10Reedy) [21:20:12] !log reedy@tin Synchronized wmf-config/CommonSettings.php: More wfLoadExtension, no config changes (duration: 00m 49s) [21:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:21:50] (03PS2) 10Reedy: Add upload_by_url right to Commons bots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309911 (https://phabricator.wikimedia.org/T145010) (owner: 10Odder) [21:23:08] (03PS2) 10Reedy: Add 'message-format' log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312404 (https://phabricator.wikimedia.org/T146416) (owner: 10Gergő Tisza) [21:23:13] (03CR) 10Reedy: [C: 032] Add 'message-format' log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312404 (https://phabricator.wikimedia.org/T146416) (owner: 10Gergő Tisza) [21:23:40] (03Merged) 10jenkins-bot: Add 'message-format' log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312404 (https://phabricator.wikimedia.org/T146416) (owner: 10Gergő Tisza) [21:24:15] (03Abandoned) 10Reedy: remove reference to undefined wmgMFUseCentralAuthToken [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313346 (owner: 1020after4) [21:24:47] (03PS2) 10Reedy: Fix 'massmessage-sender' group for ur.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314851 (https://phabricator.wikimedia.org/T147743) (owner: 10MarcoAurelio) [21:24:51] (03CR) 10Reedy: [C: 032] Fix 'massmessage-sender' group for ur.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314851 (https://phabricator.wikimedia.org/T147743) (owner: 10MarcoAurelio) [21:25:19] (03Merged) 10jenkins-bot: Fix 'massmessage-sender' group for ur.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314851 (https://phabricator.wikimedia.org/T147743) (owner: 10MarcoAurelio) [21:26:42] (03PS3) 10Reedy: Add upload_by_url right to Commons bots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309911 (https://phabricator.wikimedia.org/T145010) (owner: 10Odder) [21:26:46] (03CR) 10Reedy: [C: 032] Add upload_by_url right to Commons bots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309911 (https://phabricator.wikimedia.org/T145010) (owner: 10Odder) [21:26:46] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: Fix typo in group name. Add message-format logging group (duration: 00m 50s) [21:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:27:15] (03Merged) 10jenkins-bot: Add upload_by_url right to Commons bots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309911 (https://phabricator.wikimedia.org/T145010) (owner: 10Odder) [21:29:04] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: Add upload_by_url right to Commons bots (duration: 00m 50s) [21:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:32:29] PROBLEM - puppet last run on sca1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:45:04] 07Puppet, 10Beta-Cluster-Infrastructure: puppet failure on deployment-phab0[12] due to missing expected puppet:///modules/phabricator/sshd-phab.service - https://phabricator.wikimedia.org/T147818#2704482 (10AlexMonk-WMF) [21:56:02] RECOVERY - puppet last run on sca1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:57:55] PROBLEM - puppet last run on db1055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161010T2300). [23:00:05] Dereckson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:06:18] Early deployed. [23:09:25] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:21:38] RECOVERY - puppet last run on db1055 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures