[00:01:11] (03CR) 10BryanDavis: Add a wiki configuration tag for configured language (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319253 (https://phabricator.wikimedia.org/T149755) (owner: 10EBernhardson) [00:02:12] 06Operations, 10Gerrit, 10grrrit-wm, 13Patch-For-Review: Support restarting grrrit-wm automatically when we restart production gerrit - https://phabricator.wikimedia.org/T149609#2761897 (10Paladox) @Legoktm what I have done to achive what your asking is I have added a new library https://github.com/forever... [00:02:33] PROBLEM - puppet last run on labstore2001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:04:33] RECOVERY - puppet last run on labstore2001 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [00:07:54] 06Operations, 10ops-eqiad, 10netops: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2761935 (10faidon) [00:21:53] PROBLEM - Disk space on labstore2001 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied [00:22:23] PROBLEM - puppet last run on tegmen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:30:18] (03PS1) 10Dzahn: add AAAA and PTR for contint1001.wikimedia.org. [dns] - 10https://gerrit.wikimedia.org/r/319258 [00:30:30] (03CR) 10jenkins-bot: [V: 04-1] add AAAA and PTR for contint1001.wikimedia.org. [dns] - 10https://gerrit.wikimedia.org/r/319258 (owner: 10Dzahn) [00:31:35] (03PS2) 10Dzahn: add AAAA and PTR for contint1001.wikimedia.org. [dns] - 10https://gerrit.wikimedia.org/r/319258 [00:31:39] 06Operations, 10ops-eqiad, 10netops: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2762011 (10faidon) [00:34:15] (03PS3) 10Dzahn: add AAAA and PTR for contint1001.wikimedia.org. [dns] - 10https://gerrit.wikimedia.org/r/319258 [00:34:17] (03PS2) 10Faidon Liambotis: netops (etc.): add asw2-d-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/319098 [00:34:39] (03CR) 10Dzahn: [C: 032] "contint1001 is not in production yet (but hopefully will be tomorrow). adding v6 support now" [dns] - 10https://gerrit.wikimedia.org/r/319258 (owner: 10Dzahn) [00:34:44] (03CR) 10Faidon Liambotis: [C: 032] netops (etc.): add asw2-d-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/319098 (owner: 10Faidon Liambotis) [00:46:33] PROBLEM - puppet last run on cp3018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:51:21] 06Operations, 10ops-eqiad, 10netops: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2762043 (10faidon) a:05Cmjohnson>03faidon So, the following things happened today (and not all as smoothly or necessarily in the order listed here, a lot of interm... [00:51:26] RECOVERY - puppet last run on tegmen is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [00:53:38] (03PS2) 10Dzahn: add AAAA and PTR for eventlog1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/319150 [00:54:45] (03CR) 10Dzahn: [C: 032] add AAAA and PTR for eventlog1001.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/319150 (owner: 10Dzahn) [00:56:38] (03CR) 10Dzahn: "/srv/log/eventlogging/all-events.log still getting events like before.." [dns] - 10https://gerrit.wikimedia.org/r/319150 (owner: 10Dzahn) [00:56:43] (03CR) 10Dzahn: "/srv/log/eventlogging/all-events.log still getting events like before.." [puppet] - 10https://gerrit.wikimedia.org/r/317192 (owner: 10Dzahn) [00:56:53] 06Operations, 10ops-eqiad, 10netops: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2762054 (10faidon) a:05faidon>03Cmjohnson The following things are now pending from @Cmjohnson: - The link between D 2 and D 5 does not seem to be working — is see... [01:11:06] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [01:12:06] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:12:06] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/1: down - Core: cr1-eqord:xe-0/0/0 (Telia, IC-314534, 24ms) {#10694} [10Gbps wave]BR [01:12:16] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [01:13:16] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:13:36] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-314534, 29ms) {#11375} [10Gbps wave]BR [01:14:36] RECOVERY - puppet last run on cp3018 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [01:17:36] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [01:18:06] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [01:18:06] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [01:19:16] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [01:20:16] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:21:14] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:23:54] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [01:24:04] PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:24:54] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:26:34] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-314534, 29ms) {#11375} [10Gbps wave]BR [01:27:04] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/1: down - Core: cr1-eqord:xe-0/0/0 (Telia, IC-314534, 24ms) {#10694} [10Gbps wave]BR [01:30:36] (03PS3) 10Dzahn: icinga: move files/icinga/ into module [puppet] - 10https://gerrit.wikimedia.org/r/318436 (https://phabricator.wikimedia.org/T110893) [01:41:04] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [01:41:34] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [01:45:14] PROBLEM - check_mysql on frdb1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2746 [01:48:04] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 1809.984542 Seconds [01:49:04] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 9.955776 Seconds [01:50:14] PROBLEM - check_mysql on frdb1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 3046 [01:52:59] RECOVERY - puppet last run on netmon1001 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [01:55:09] RECOVERY - check_mysql on frdb1001 is OK: Uptime: 387917 Threads: 5 Questions: 32007313 Slow queries: 2767 Opens: 2006 Flush tables: 1 Open tables: 598 Queries per second avg: 82.510 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [02:00:59] RECOVERY - Last backup of the tools filesystem on labstore1001 is OK: OK - Last run for unit replicate-tools was successful [02:03:16] (03CR) 10Dzahn: icinga: move files/icinga/ into module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/318436 (https://phabricator.wikimedia.org/T110893) (owner: 10Dzahn) [02:03:47] (03CR) 10Dzahn: "@Akosiaris yea, these 2 files also looked unused to me. I checked git log, and interestingly i found this:" [puppet] - 10https://gerrit.wikimedia.org/r/318436 (https://phabricator.wikimedia.org/T110893) (owner: 10Dzahn) [02:04:43] (03CR) 10Dzahn: "AndrewBogott: Do you remember why you reverted this back then?" [puppet] - 10https://gerrit.wikimedia.org/r/165007 (owner: 10Andrew Bogott) [02:05:14] (03Abandoned) 10Alex Monk: puppet::self::gitclone: Get gitdir from puppetmaster::base_repo [puppet] - 10https://gerrit.wikimedia.org/r/310729 (owner: 10Alex Monk) [02:06:11] (03CR) 10Dzahn: "asking because of https://gerrit.wikimedia.org/r/#/c/318436/" [puppet] - 10https://gerrit.wikimedia.org/r/165007 (owner: 10Andrew Bogott) [02:10:57] (03PS2) 10Dzahn: osm: move files/osm/tuning.conf to role module [puppet] - 10https://gerrit.wikimedia.org/r/318453 [02:11:12] (03CR) 10Dzahn: "done" [puppet] - 10https://gerrit.wikimedia.org/r/318453 (owner: 10Dzahn) [02:12:56] (03PS4) 10Dzahn: realm: add 'projectcom' to private wiki list [puppet] - 10https://gerrit.wikimedia.org/r/305095 (https://phabricator.wikimedia.org/T143138) [02:16:49] (03CR) 10Dzahn: "remove eventlog2001 per Moritz' comment. eventlog1001 now has IPv6 records." [puppet] - 10https://gerrit.wikimedia.org/r/316497 (owner: 10Dzahn) [02:17:02] grrrit-wm: restart [02:17:39] PROBLEM - puppet last run on analytics1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:18:53] !log l10nupdate@tin scap sync-l10n completed (1.28.0-wmf.23) (duration: 05m 43s) [02:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:27:39] 06Operations, 10Gerrit, 10grrrit-wm, 13Patch-For-Review: Support restarting grrrit-wm automatically when we restart production gerrit - https://phabricator.wikimedia.org/T149609#2762126 (10Paladox) @Legoktm ok restarting only the ssh connection is supported now. You will need to be in the whitelist to run... [02:35:10] (03PS8) 10Dzahn: zuul: migrate server only settings out of merger [puppet] - 10https://gerrit.wikimedia.org/r/309299 (owner: 10Hashar) [02:35:12] (03PS8) 10Dzahn: zuul: migrate server only settings out of merger [puppet] - 10https://gerrit.wikimedia.org/r/309299 (owner: 10Hashar) [02:36:28] (03CR) 10Dzahn: [C: 032] zuul: migrate server only settings out of merger [puppet] - 10https://gerrit.wikimedia.org/r/309299 (owner: 10Hashar) [02:36:30] (03CR) 10Dzahn: [C: 032] zuul: migrate server only settings out of merger [puppet] - 10https://gerrit.wikimedia.org/r/309299 (owner: 10Hashar) [02:40:04] (03CR) 10Dzahn: "zuul merge on scandium changed. no-op on gallium/contint1001 confirmed." [puppet] - 10https://gerrit.wikimedia.org/r/309299 (owner: 10Hashar) [02:44:54] (03PS3) 10Dzahn: contint: drop contint::packages [puppet] - 10https://gerrit.wikimedia.org/r/317988 (owner: 10Hashar) [02:45:01] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.1) (duration: 10m 43s) [02:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:42] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [02:50:28] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Nov 2 02:50:27 UTC 2016 (duration 5m 26s) [02:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:59:33] Heads up im doing some whitelist coding so if grrrit-wm disconnects(ill try to make it less often) thats why! [03:00:18] Zppix: please use #wikimedia-bot-gerrit [03:00:29] RECOVERY - Last backup of the others filesystem on labstore1001 is OK: OK - Last run for unit replicate-others was successful [03:00:35] there is a copy there for testing [03:00:40] ?? [03:00:42] Where [03:01:13] grrrit-wm-test in that channel, yep [03:01:25] same tool acct?? [03:01:37] yes [03:01:40] afaict [03:01:48] and work on top of https://gerrit.wikimedia.org/r/#/c/318976/9/src/relay.js [03:02:06] it's better now, it doesnt need to kill the IRC connection anymore [03:02:12] just the gerrit ssh connection [03:02:22] or it should [03:04:06] Ive been working on shell to prevent need for merge [03:04:59] it's duplicate work though. he already amended the whitelist part as well. it's an array now [03:06:05] mutante: im working in whitelist.js [03:07:06] ok, i would suggest you also upload to gerrit and we merge more often, many small changes > live hacking and waiting for one change that does it all [03:07:55] I cant push to gerrit from tool.lolrrit and i dont have access to pc (sshing over mobile) [03:08:33] then push from your home computer? [03:08:43] oh, right [03:08:53] we talked about the https push thing before [03:09:07] i remember now [03:09:23] mutante: i can push from my pc [03:09:27] But no access [03:09:38] As its wifi is being stupid [03:09:44] Zppix: gotcha [03:10:07] well, just keep the whitelist.js file there and we can push it tomorrow or something [03:10:44] mutante: we could merge it with the current gerrit bot change by paladox as its meant for that [03:12:15] that's what i meant by "work on top of it" [03:12:18] basically [03:12:36] take his version and build on it [03:13:13] I have been somewhat not all tge stuff hes uploaded was his weve been tag teaming [03:13:54] yes, i know, several people have worked on it by now [03:13:59] RECOVERY - Disk space on labstore2001 is OK: DISK OK [03:23:45] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 669.79 seconds [03:29:45] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 602.36 seconds [03:43:45] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 299.83 seconds [04:00:44] RECOVERY - Last backup of the maps filesystem on labstore1001 is OK: OK - Last run for unit replicate-maps was successful [04:04:44] PROBLEM - Last backup of the maps filesystem on labstore1001 is CRITICAL: CRITICAL - Last run result for unit replicate-maps was exit-code [04:07:01] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2762226 (10bearND) In addition to that I'd like to know what the relationship between this task and the Thu... [05:35:08] (03PS1) 10Dzahn: fix IPv6 reverse record for contint1001 [dns] - 10https://gerrit.wikimedia.org/r/319267 [05:37:41] (03CR) 10Dzahn: [C: 032] fix IPv6 reverse record for contint1001 [dns] - 10https://gerrit.wikimedia.org/r/319267 (owner: 10Dzahn) [05:38:58] (03CR) 10Dzahn: "[radon:~] $ host contint1001.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/319267 (owner: 10Dzahn) [05:40:54] (03CR) 10Dzahn: "[radon:~] $ ping6 contint1001.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/319267 (owner: 10Dzahn) [05:49:32] (03PS1) 10Dzahn: aptrepo: fix typo in template that broke release uploads [puppet] - 10https://gerrit.wikimedia.org/r/319268 [05:57:12] (03PS2) 10Dzahn: aptrepo: fix typo in template that broke release uploads [puppet] - 10https://gerrit.wikimedia.org/r/319268 [05:57:44] (03CR) 10Dzahn: [C: 032] aptrepo: fix typo in template that broke release uploads [puppet] - 10https://gerrit.wikimedia.org/r/319268 (owner: 10Dzahn) [05:58:16] PROBLEM - Host labstore2001 is DOWN: PING CRITICAL - Packet loss = 100% [06:00:39] !log re-enable puppet on bromine after gerrit 319268 [06:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:55] I've got labstore2001 [06:01:01] silenced now [06:01:08] thanks madhuvishy :) [06:01:18] and i'm also out again, heh [06:09:36] PROBLEM - puppet last run on cp4001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[generate_varnishkafka_webrequest_gmond_pyconf] [06:30:51] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:38:31] RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:40:31] PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:59:52] RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [07:00:29] 06Operations, 10ops-codfw: Predictive disk failure on db2047 - https://phabricator.wikimedia.org/T149670#2762261 (10Marostegui) Hey @Papaul There yo go: ``` root@db2047:~# hpssacli ctrl all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337E0DB0) Gen8 ServBP 12+2 at Port 1I, Box 1, O... [07:02:22] PROBLEM - puppet last run on wtp1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:08:32] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [07:13:52] 06Operations, 10ops-codfw, 10DBA: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2762273 (10Marostegui) I am going to try to fix db2011 today. This server belongs to m2 shard. This is what I am going to do, in order to roll back if this box happens to fail. First, I am planning on stop... [07:19:07] !log Stopping MySQL db2011 for maintenance - T149099 [07:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:15] T149099: db2011 disk media errors - https://phabricator.wikimedia.org/T149099 [07:22:15] PROBLEM - haproxy failover on dbproxy1002 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [07:22:25] PROBLEM - haproxy failover on dbproxy1007 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [07:23:15] RECOVERY - haproxy failover on dbproxy1002 is OK: OK check_failover servers up 2 down 0 [07:23:25] RECOVERY - haproxy failover on dbproxy1007 is OK: OK check_failover servers up 2 down 0 [07:28:05] RECOVERY - Host labstore2001 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [07:30:25] RECOVERY - puppet last run on wtp1022 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [07:50:09] (03PS2) 10Giuseppe Lavagetto: thumbor: use restart:always instead of on-failure [puppet] - 10https://gerrit.wikimedia.org/r/318903 [08:02:27] (03PS2) 10Muehlenhoff: zookeeper: Retrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/311138 [08:10:55] !log change-prop deploying a28f9ba [08:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:16] (03CR) 10Giuseppe Lavagetto: [C: 032] thumbor: use restart:always instead of on-failure [puppet] - 10https://gerrit.wikimedia.org/r/318903 (owner: 10Giuseppe Lavagetto) [08:21:23] (03CR) 10Alexandros Kosiaris: "The answer is in https://gerrit.wikimedia.org/r/#/c/164682/" [puppet] - 10https://gerrit.wikimedia.org/r/165007 (owner: 10Andrew Bogott) [08:22:10] (03CR) 10Alexandros Kosiaris: "yes, chasing down the reverts looks like apart from those 2 removes ones, it also removed a used check_mariadb.pl" [puppet] - 10https://gerrit.wikimedia.org/r/318436 (https://phabricator.wikimedia.org/T110893) (owner: 10Dzahn) [08:22:41] (03CR) 10Alexandros Kosiaris: [C: 032] icinga: move files/icinga/ into module [puppet] - 10https://gerrit.wikimedia.org/r/318436 (https://phabricator.wikimedia.org/T110893) (owner: 10Dzahn) [08:22:45] (03PS4) 10Alexandros Kosiaris: icinga: move files/icinga/ into module [puppet] - 10https://gerrit.wikimedia.org/r/318436 (https://phabricator.wikimedia.org/T110893) (owner: 10Dzahn) [08:22:49] (03CR) 10Alexandros Kosiaris: [V: 032] icinga: move files/icinga/ into module [puppet] - 10https://gerrit.wikimedia.org/r/318436 (https://phabricator.wikimedia.org/T110893) (owner: 10Dzahn) [08:23:59] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [08:24:19] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [08:24:58] 06Operations, 10ops-codfw, 10DBA: install new disks into dbstore2001 - https://phabricator.wikimedia.org/T149457#2762328 (10Marostegui) @jcrespo yo ok if I delete all the content of `dbstore2002:/srv/sqldata` today and move the snapshot from dbstore2001, start dbstore2002 and once that is done, we can contin... [08:30:19] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:32:57] !log restarted cassandra-metrics-collector on aqs100[456] for jvm upgrades [08:32:59] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:56] !log scb10ox stopping puppet and CP for Cassandra restarts [08:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:49] !log rolling restart of cassandra on restbase in eqiad to pick up new Java security updates [08:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:59] PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [08:38:09] PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [08:38:49] PROBLEM - changeprop endpoints health on scb1004 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.29, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [08:38:49] PROBLEM - changeprop endpoints health on scb1003 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.153, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [08:40:56] <_joe_> uh? [08:42:21] known ^ [08:42:26] <_joe_> ah [08:42:36] <_joe_> I restarted changeprop on scb1001 already [08:42:41] damn [08:42:41] <_joe_> I shouldn't have? [08:42:45] will stop it back [08:42:52] <_joe_> ok sorry [08:42:58] ok stopped [08:43:09] <_joe_> mobrovac: I didn't see a !log so I assumed it crashed [08:43:22] <_joe_> the CP wasn't obvious :P [08:43:33] !log Stopping mysql dbstore2002 for maintenance - T149457 [08:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:39] T149457: install new disks into dbstore2001 - https://phabricator.wikimedia.org/T149457 [08:44:22] _joe_: yeah logged it as scb stopped puppet and CP [08:46:11] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2052 - https://phabricator.wikimedia.org/T149377#2762366 (10jcrespo) This is the command output: ``` $ hpssacli ctrl all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337E0DB0) Gen8 ServBP 12+2 at Port 1I, Box 1, OK array A (S... [08:46:49] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2052 - https://phabricator.wikimedia.org/T149377#2762367 (10jcrespo) Sorry, wrong ticket- I will remove my previous comment. [08:51:42] (03PS1) 10Elukey: Add new mc* servers to site.pp with role:spare [puppet] - 10https://gerrit.wikimedia.org/r/319278 (https://phabricator.wikimedia.org/T137345) [08:54:11] 06Operations, 10ops-codfw, 10DBA: install new disks into dbstore2001 - https://phabricator.wikimedia.org/T149457#2762437 (10Marostegui) The transfer is now happening between dbstore2001 and dbstore2002 [08:54:42] 06Operations, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: Varnishlog with Start timestamp but no Resp one causing data consistency check alarms - https://phabricator.wikimedia.org/T148412#2762438 (10elukey) This is probably due to: https://github.com/openssl/openssl/issues/1799 [09:02:18] (03PS1) 10Giuseppe Lavagetto: profile::docker::registry: allow overriding the swift password [puppet] - 10https://gerrit.wikimedia.org/r/319279 [09:02:19] 06Operations, 06Analytics-Kanban, 10Traffic: Varnishlog with Start timestamp but no Resp one causing data consistency check alarms - https://phabricator.wikimedia.org/T148412#2762540 (10elukey) [09:03:54] (03CR) 10Alexandros Kosiaris: "I am a bit perplexed by this. Any estimation of how much complex structures do we need ? I 've looked into https://gerrit.wikimedia.org/r/" [puppet] - 10https://gerrit.wikimedia.org/r/319129 (https://phabricator.wikimedia.org/T148779) (owner: 10Ottomata) [09:08:06] 06Operations, 10DBA: Review Icinga alarms with disabled notifications - https://phabricator.wikimedia.org/T149643#2762558 (10Volans) @MoritzMuehlenhoff: done :wink: [09:14:17] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::docker::registry: allow overriding the swift password [puppet] - 10https://gerrit.wikimedia.org/r/319279 (owner: 10Giuseppe Lavagetto) [09:14:46] 06Operations, 10DBA: Review Icinga alarms with disabled notifications - https://phabricator.wikimedia.org/T149643#2758740 (10jcrespo) A number of hosts have been decommissioned. As per new instructions, they are not removed from icinga (puppet) nor stopped, and role::spare doesn't do that either. > Leaving th... [09:27:44] (03CR) 10Jcrespo: "Submitted at 11:43" [puppet] - 10https://gerrit.wikimedia.org/r/318900 (https://phabricator.wikimedia.org/T149340) (owner: 10Alexandros Kosiaris) [09:30:59] 06Operations, 10Traffic: varnish-backend: weekly cron restart for all clusters - https://phabricator.wikimedia.org/T149784#2762582 (10ema) [09:31:11] 06Operations, 10Traffic: varnish-backend: weekly cron restart for all clusters - https://phabricator.wikimedia.org/T149784#2762597 (10ema) p:05Triage>03Normal [09:41:23] 06Operations, 10Traffic: varnish-backend: weekly cron restart for all clusters - https://phabricator.wikimedia.org/T149784#2762622 (10ema) a:03ema [09:42:50] 06Operations, 10ops-codfw, 10DBA: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2762624 (10Marostegui) The backup finished, and I was able to extract it, so proceeding now. Clearing the foreign config ``` root@db2011:~# megacli -CfgForeign -Scan -aALL There are 1 foreign configurati... [09:43:34] PROBLEM - puppet last run on cp1072 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:49:30] 06Operations, 10DBA: Review Icinga alarms with disabled notifications - https://phabricator.wikimedia.org/T149643#2762648 (10Volans) >>! In T149643#2762562, @jcrespo wrote: > Disabling notifications is the only way to not alert. @jcrespo: I know, and there are also other reasons why make sense to disable noti... [09:52:00] (03PS1) 10Ema: cache_: enable varnish-be weekly cron restart for all clusters [puppet] - 10https://gerrit.wikimedia.org/r/319284 (https://phabricator.wikimedia.org/T149784) [09:53:18] 06Operations, 10DBA: Review Icinga alarms with disabled notifications - https://phabricator.wikimedia.org/T149643#2762654 (10jcrespo) But they have it, see: ``` 2016-09-21 09:01:06 Marostegui This server is going to be decommissioned - T146265 ``` This is for db1019. It is just not shown on the link you share... [09:54:20] 06Operations, 10DBA: Review Icinga alarms with disabled notifications - https://phabricator.wikimedia.org/T149643#2758740 (10Marostegui) >>! In T149643#2762654, @jcrespo wrote: > But they have it, see: > ``` > 2016-09-21 09:01:06 Marostegui This server is going to be decommissioned - T146265 > ``` > > This is... [10:01:32] !grrrit-wm-restart [10:01:32] re-connecting to gerrit [10:02:25] !grrrit-wm-restart [10:02:27] re-connecting to gerrit [10:04:21] !grrrit-wm-restart [10:04:38] re-connecting to gerrit [10:04:39] reconnected to gerrit [10:05:01] yay it works :) [10:05:31] 06Operations, 10ChangeProp, 10MediaWiki-JobQueue, 06Performance-Team, and 2 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2762677 (10mobrovac) I agree as well that we shouldn't complicate things for small wikis. But, correct me if I'm... [10:05:37] (03CR) 10Jcrespo: [C: 04-1] "actually, that is a bad name for the wiki- all wikis must contain the wik particle." [puppet] - 10https://gerrit.wikimedia.org/r/305095 (https://phabricator.wikimedia.org/T143138) (owner: 10Dzahn) [10:05:39] (03CR) 10Jcrespo: [C: 04-1] "actually, that is a bad name for the wiki- all wikis must contain the wik particle." [puppet] - 10https://gerrit.wikimedia.org/r/305095 (https://phabricator.wikimedia.org/T143138) (owner: 10Dzahn) [10:06:07] 06Operations, 10DBA: Review Icinga alarms with disabled notifications - https://phabricator.wikimedia.org/T149643#2762679 (10Volans) @jcrespo @Marostegui: `db1019` it's fine, it has a scheduled downtime with a related comment and you can see it directly from the link I put in the description that it has one.... [10:08:47] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/1: down - Core: cr1-eqord:xe-0/0/0 (Telia, IC-314534, 24ms) {#10694} [10Gbps wave]BR [10:09:07] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-314534, 29ms) {#11375} [10Gbps wave]BR [10:10:07] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [10:10:47] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [10:10:50] 06Operations, 10MediaWiki-Configuration, 15User-mobrovac, 10Wikimedia-Developer-Summit (2017): Integrating MediaWiki (and other services) with dynamic configuration - https://phabricator.wikimedia.org/T149617#2762696 (10mobrovac) [10:11:37] RECOVERY - puppet last run on cp1072 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [10:13:56] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2558051 (10jcrespo) projectcom is a bad name for a wiki **database name**. If this is a wiki, it must contain the wik string e.g. (enwiktionary, enwiki, ptwikimed... [10:17:26] 06Operations, 10ops-codfw, 10DBA: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2762715 (10Marostegui) 32:4 finished the rebuild correctly Starting 32:7 ``` root@db2011:~# megacli -PdReplaceMissing -PhysDrv[32:7] -array3 -row1 -a0 Adapter: 0: Missing PD at Array 3, Row 1 is replaced.... [10:18:45] 06Operations, 10DBA: Review Icinga alarms with disabled notifications - https://phabricator.wikimedia.org/T149643#2762717 (10jcrespo) @Volans db1019 scheduled downtime will eventually expire (where the comments is), the comment I was referring to is not shown on the link. [10:26:19] 06Operations, 10DBA: Review Icinga alarms with disabled notifications - https://phabricator.wikimedia.org/T149643#2758740 (10Joe) If you assign `role::spare` to a server and run puppet on the host and the icinga host, you should remove all alerts there. If that's not the case, it is a bug in puppet and we sho... [10:32:44] 06Operations, 10DBA: Review Icinga alarms with disabled notifications - https://phabricator.wikimedia.org/T149643#2762731 (10jcrespo) @Joe Apparently it removes it, if it didn't, it would show mysql alertss. But it keeps the common ones, which we do not want to show (plus potentially any running process that c... [10:33:19] !log rolling restart of cassandra on restbase in eqiad completed [10:33:29] ^ mobrovac [10:33:42] \o/ [10:35:05] !log scb starting back CP and re-enabling puppet [10:35:10] RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy [10:35:30] RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy [10:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:50] RECOVERY - changeprop endpoints health on scb1003 is OK: All endpoints are healthy [10:35:50] RECOVERY - changeprop endpoints health on scb1004 is OK: All endpoints are healthy [10:36:13] 06Operations, 10DBA: Review Icinga alarms with disabled notifications - https://phabricator.wikimedia.org/T149643#2762748 (10Volans) >>! In T149643#2762717, @jcrespo wrote: > @Volans db1019 scheduled downtime will eventually expire (where the comments is), the comment I was referring to is not shown on the lin... [10:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:20] <_joe_> mobrovac: having to stop/restart CP when restbase is undergoing maintenance doesn't seem like a great option; we should maybe make CP able to stall rb jobs in that phase? [10:36:34] 06Operations, 10ops-codfw: Broken disk in labstore2001 - https://phabricator.wikimedia.org/T149567#2762750 (10MoritzMuehlenhoff) p:05Triage>03Normal a:03Papaul [10:37:20] _joe_: this is more of a precaution due to cassandra restarts, but i agree that it's not ideal [10:37:36] _joe_: but stopping RB jobs in CP is equivalent more or less to stopping CP [10:37:42] for now, though [10:38:05] <_joe_> yeah it's just a whishlist item :P [10:38:08] 06Operations, 10ChangeProp, 10MediaWiki-JobQueue, 06Performance-Team, and 2 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2762756 (10dcausse) The current jobqueue provides features that may not be available out of the box with Kafka: -... [10:40:28] 06Operations, 10DNS, 10Domains, 10Traffic, and 2 others: Point wikipedia.in to 205.147.101.160 instead of URL forward - https://phabricator.wikimedia.org/T144508#2762761 (10Naveenpf) >>! In T144508#2739847, @CRoslof wrote: > If @Naveenpf or #operations is still interested in pursuing this task's specific r... [10:45:04] (03PS1) 10Ema: site: add varnish_exporter to esams/eqiad maps/misc [puppet] - 10https://gerrit.wikimedia.org/r/319291 (https://phabricator.wikimedia.org/T147424) [10:46:24] 06Operations, 10ops-codfw, 10DBA: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2762769 (10Marostegui) 32:7 finished fine. Starting 32:11 ``` root@db2011:~# megacli -PdReplaceMissing -PhysDrv[32:11] -array5 -row1 -a0 Adapter: 0: Missing PD at Array 5, Row 1 is replaced. Exit Code: 0... [10:51:31] !log installing mailman security update [10:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:08] PROBLEM - mailman_qrunner on fermium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/qrunner [10:55:08] RECOVERY - mailman_qrunner on fermium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [10:55:50] (03PS1) 10Ema: prometheus: extend Varnish targets generation to text/upload [puppet] - 10https://gerrit.wikimedia.org/r/319293 (https://phabricator.wikimedia.org/T147424) [10:55:58] PROBLEM - puppet last run on fermium is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[mailman],Exec[debconf-communicate set mailman/site_languages],Exec[debconf-communicate set mailman/default_server_language] [10:56:54] 06Operations, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 3 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2762817 (10mobrovac) The eventbus system already has retries built in: if a job fails, it is retried a predefined number... [10:57:55] 06Operations, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 3 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2762821 (10mobrovac) [10:58:18] 06Operations, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2751310 (10mobrovac) [10:58:21] 06Operations, 10DBA: Review Icinga alarms with disabled notifications - https://phabricator.wikimedia.org/T149643#2762823 (10jcrespo) >>! In T149643#2762748, @Volans wrote: >>>! In T149643#2762717, @jcrespo wrote: >> @Volans db1019 scheduled downtime will eventually expire (where the comments is), the comment... [10:59:58] PROBLEM - puppet last run on elastic1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:02:58] RECOVERY - puppet last run on fermium is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [11:05:03] (03PS1) 10Ema: site: add varnish_exporter to text/upload [puppet] - 10https://gerrit.wikimedia.org/r/319295 (https://phabricator.wikimedia.org/T147424) [11:07:57] 06Operations, 10DBA: Review Icinga alarms with disabled notifications - https://phabricator.wikimedia.org/T149643#2762856 (10MoritzMuehlenhoff) >>! In T149643#2762731, @jcrespo wrote: > Anyway, I would like to put down completely a host until it has been formatted, *then* it can be added to role:spare and be u... [11:24:30] (03PS1) 10Jcrespo: mariadb: Depool db1052 and db1073 once extra en api load has gone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319297 [11:26:56] (03PS1) 10Jcrespo: mariadb: Remove references to db1042 for decommission [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319298 [11:27:34] (03PS1) 10Alexandros Kosiaris: Revert "tendril: Supply a robots.txt disallow all robots" [puppet] - 10https://gerrit.wikimedia.org/r/319299 (https://phabricator.wikimedia.org/T149340) [11:27:35] 06Operations, 06Analytics-Kanban, 10Traffic: Varnishlog with Start timestamp but no Resp one causing data consistency check alarms - https://phabricator.wikimedia.org/T148412#2762917 (10elukey) After a chat with @ema we decided to test a very basic use case, namely if a TCP RST from the client could cause th... [11:27:36] (03PS1) 10Alexandros Kosiaris: Revert "Revert "tendril: Supply a robots.txt disallow all robots"" [puppet] - 10https://gerrit.wikimedia.org/r/319300 (https://phabricator.wikimedia.org/T149340) [11:28:10] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Revert "tendril: Supply a robots.txt disallow all robots" [puppet] - 10https://gerrit.wikimedia.org/r/319299 (https://phabricator.wikimedia.org/T149340) (owner: 10Alexandros Kosiaris) [11:28:54] RECOVERY - puppet last run on elastic1038 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [11:29:00] (03CR) 10Alexandros Kosiaris: "Reverted in https://gerrit.wikimedia.org/r/#/c/319299/" [puppet] - 10https://gerrit.wikimedia.org/r/318900 (https://phabricator.wikimedia.org/T149340) (owner: 10Alexandros Kosiaris) [11:32:49] (03CR) 10BBlack: [C: 031] cache_: enable varnish-be weekly cron restart for all clusters [puppet] - 10https://gerrit.wikimedia.org/r/319284 (https://phabricator.wikimedia.org/T149784) (owner: 10Ema) [11:35:19] (03CR) 10Jcrespo: "There is actualy meaningful actions I would like to do- I would like to disable automatic deployment of the latest version (which this cha" [puppet] - 10https://gerrit.wikimedia.org/r/319300 (https://phabricator.wikimedia.org/T149340) (owner: 10Alexandros Kosiaris) [11:37:20] 06Operations, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2762957 (10dcausse) >> - spread over time some sanity checks > > Could you elaborate on this use case? In cirrus we hav... [11:37:56] (03PS3) 10Muehlenhoff: zookeeper: Retrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/311138 [11:42:32] mobrovac: --^ [11:42:57] should be a no-op but let's watch changeprop :) [11:43:06] kk [11:44:03] (03CR) 10Muehlenhoff: [C: 032] zookeeper: Retrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/311138 (owner: 10Muehlenhoff) [11:46:50] Is there reasons why python 3 isnt being used? [11:48:58] marostegui, did db2034 downtimes just expired? [11:49:25] looks like it expired [11:49:30] let me downtime it again [11:49:46] (03PS2) 10Ema: cache_: enable varnish-be weekly cron restart for all clusters [puppet] - 10https://gerrit.wikimedia.org/r/319284 (https://phabricator.wikimedia.org/T149784) [11:49:54] (03CR) 10Ema: [C: 032 V: 032] cache_: enable varnish-be weekly cron restart for all clusters [puppet] - 10https://gerrit.wikimedia.org/r/319284 (https://phabricator.wikimedia.org/T149784) (owner: 10Ema) [11:49:57] wanted to check with you before doing it myself [11:50:13] downtime the whole thing [11:50:16] yeah [11:50:19] I just did [11:50:19] it will be reimaged [11:50:20] thanks [11:50:31] yes, I need to do it [11:50:52] not necesarily you, it is still on the pending bucket :-) [11:51:13] and it is not high priority [11:51:50] (03PS1) 10Jcrespo: tendril+dbtree: Explicitly disable automatic pulls from HEAD [puppet] - 10https://gerrit.wikimedia.org/r/319301 (https://phabricator.wikimedia.org/T149340) [11:54:00] (03CR) 10Jcrespo: "Now, becaue of this https://gerrit.wikimedia.org/r/319301 (still not final, only a proposal for now) the logic has to change, because othe" [puppet] - 10https://gerrit.wikimedia.org/r/319300 (https://phabricator.wikimedia.org/T149340) (owner: 10Alexandros Kosiaris) [11:59:26] jouncebot: refresh [11:59:28] I refreshed my knowledge about deployments. [12:00:18] (03CR) 10Jcrespo: "I am thinking of creating a rewrite rule for the virtualhost redirecting outside of the repo dir, and deploy the robots.txt there." [puppet] - 10https://gerrit.wikimedia.org/r/319300 (https://phabricator.wikimedia.org/T149340) (owner: 10Alexandros Kosiaris) [12:05:52] 06Operations, 10Gerrit, 10grrrit-wm, 13Patch-For-Review: Support restarting grrrit-wm automatically when we restart production gerrit - https://phabricator.wikimedia.org/T149609#2763074 (10Zppix) [12:06:37] (03PS2) 10Muehlenhoff: hhvm::admin: Restrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/304476 [12:08:17] 06Operations, 10ops-eqiad: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T149728#2763079 (10MoritzMuehlenhoff) a:03Cmjohnson [12:08:32] 06Operations, 10ops-codfw: Degraded RAID on db2052 - https://phabricator.wikimedia.org/T149703#2763081 (10MoritzMuehlenhoff) a:03Papaul [12:09:13] moritzm: do you think that the auto-created tasks should also be assigned? ^^^ [12:09:20] 06Operations, 10Gerrit, 10grrrit-wm, 13Patch-For-Review: Support restarting grrrit-wm automatically when we restart production gerrit - https://phabricator.wikimedia.org/T149609#2763086 (10Zppix) Im also currently working on a whitelist code to let certain users add people to whitelist for admin commands a... [12:09:37] 06Operations, 06Performance-Team, 10Thumbor, 15User-Joe: Thumbor instances exit with exit code 0 even when crashing/failing - https://phabricator.wikimedia.org/T149560#2763088 (10Joe) [12:09:44] 06Operations, 10ops-codfw, 10DBA: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2763089 (10Marostegui) The last disk finished fine and the RAID is now Optimal ``` Device Present ================ Virtual Drives : 1 Degraded : 0 Offline... [12:10:01] 06Operations, 10DBA: Puppetize tendril web user creation - https://phabricator.wikimedia.org/T148955#2763092 (10MoritzMuehlenhoff) p:05Triage>03Normal [12:10:06] AFAIK adding them to the right DC-Ops tag was supposed to be enough [12:11:32] !log Updated Wikidata's property suggester with data from Monday's json dump and applied the T132839 workarounds [12:11:32] (03CR) 10Marostegui: [C: 031] "I will use db1052 to clone db2034 once it is reimaged so this is good! :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319297 (owner: 10Jcrespo) [12:11:35] sjoerddebruin: ^ [12:11:37] volans: I think that's fine, they can be assigned by the clinic duty person [12:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:38] T132839: Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839 [12:12:08] (03CR) 10Marostegui: [C: 031] mariadb: Remove references to db1042 for decommission [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319298 (owner: 10Jcrespo) [12:13:02] Question can two phab tasks be added to gerrit commit msgs [12:14:09] Zppix yes [12:14:15] Ok [12:14:18] There can be as much as you like [12:17:30] 06Operations, 10Traffic: restrict upload cache access for private wikis - https://phabricator.wikimedia.org/T129839#2117602 (10BBlack) @fgiunchedi - deny all access to them via upload.wm.o? [12:20:52] 06Operations, 10Gerrit, 10grrrit-wm, 13Patch-For-Review: Support restarting grrrit-wm automatically when we restart production gerrit - https://phabricator.wikimedia.org/T149609#2757787 (10hashar) Why are you seeking to restart the bot entirely and or adding a user facing command to manually restart it ?... [12:21:48] hashar: read all the phab task please? [12:32:03] !log reedy@tin Synchronized php-1.29.0-wmf.1/includes/EditPage.php: Fix regression from 1.28.0-wmf.23 T149473 (duration: 00m 47s) [12:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:09] T149473: Edit warnings have escaped HTML - https://phabricator.wikimedia.org/T149473 [12:38:34] hashar: what's the plan with swat today? [12:47:34] hoo: thanks, again. [12:55:42] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [12:56:42] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [12:59:02] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 635 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3076472 keys, up 2 days 4 hours - replication_delay is 635 [13:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161102T1300). Please do the needful. [13:00:05] yurik and MatmaRex: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:04:01] here [13:04:02] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3059331 keys, up 2 days 4 hours - replication_delay is 0 [13:04:06] hi. [13:04:17] hi [13:04:47] who's swatting? [13:06:53] looks like nobody is around for eu swat, except me [13:10:07] MatmaRex, yurik: I am new to swat, but I could do it, if you could help a bit :) [13:10:19] sure [13:10:36] I have done swat of config changes, not sure what is different for core and extensions [13:11:07] hashar: around for swat? [13:11:24] oh men DST [13:11:27] forgot about that [13:11:46] hashar: is swat locked to UTC? [13:11:54] yeah [13:12:11] ok, so it's 2-3pm our time for the next 6 months or so [13:12:30] hashar: want to do swat, or should I do it? [13:13:54] I have CR+2 all of them [13:14:19] hashar: ok, you are doing the swat then? [13:14:29] guess I will? [13:14:56] hashar: ok, good luck then, MatmaRex and yurik are here [13:15:09] (03CR) 10Alexandros Kosiaris: [C: 031] tendril+dbtree: Explicitly disable automatic pulls from HEAD [puppet] - 10https://gerrit.wikimedia.org/r/319301 (https://phabricator.wikimedia.org/T149340) (owner: 10Jcrespo) [13:16:02] yurik: have you managed to test your zero* changes on beta cluster? [13:16:18] will get both changes pushed on mw1099 whenever they are merged by CI [13:19:47] should have +2ed before the swat :D [13:21:39] hashar, I keep hearing people wanting to NOT +2 before the swat :) [13:21:48] i will test it now :) [13:22:06] beta cluster + zero are ... somewhat hard :( [13:23:33] :D [13:23:39] we will need to invest some time in setting it up correctly there... [13:23:44] possibly a lot of time :( [13:24:48] pulled on tin [13:25:08] yurik: pulled on mw1099 , they are ready for testing :) [13:25:58] MatmaRex: if around https://gerrit.wikimedia.org/r/#/c/319304/ gallery.css is now on mw1099 [13:26:05] but I am just going to push it [13:26:14] ohh boy, here i go again. hashar, it might take some time for me ... expecting 10-15 min [13:26:27] (maybe quicker, but who knows ) [13:26:48] yurik: no problems. Take your time :) [13:27:21] !log hashar@tin Synchronized php-1.29.0-wmf.1/resources/src/mediawiki/page/gallery.css: [13:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:31] MatmaRex: gallery.css updated ! [13:27:36] hashar: yeah, i'm here. verified on testwiki [13:27:46] \O/ [13:27:48] with mw1099 [13:27:55] thanks :) [13:28:59] (03CR) 10Alexandros Kosiaris: "not sure why this change depends on non-automatic pulls of the latest by git::clone (or being deployed by scap). Granted we could deploy t" [puppet] - 10https://gerrit.wikimedia.org/r/319300 (https://phabricator.wikimedia.org/T149340) (owner: 10Alexandros Kosiaris) [13:32:49] hashar, seems to be good, go ahead [13:32:56] zeljkof: not exactly for "ok, so it's 2-3pm our time for the next 6 months or so" and "swat locked to UTC" ← deployment hours are locked to SF time. The current drift is because UE and US don't use the same DST change date, so that's only an issue one week at fall and two seeks at spring. [13:33:14] 06Operations, 10Wikimedia-General-or-Unknown: Icinga has httpauth on (not accessible for public) - https://phabricator.wikimedia.org/T62112#2763301 (10akosiaris) [13:33:42] Dereckson: one more reason to hate DST [13:34:07] !log hashar@tin Started scap: ZeroBanner / ZeroPortal extensions.json fix [13:34:15] yurik: pushing [13:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:52] Dereckson: zeljkof maybe we should lock the EU Swat time to the main Europe timezone :-} [13:35:14] PROBLEM - puppet last run on db1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:35:34] (03CR) 10Alexandros Kosiaris: [C: 032] "looks fine in https://puppet-compiler.wmflabs.org/4517/labsdb1006.eqiad.wmnet/, merging" [puppet] - 10https://gerrit.wikimedia.org/r/318453 (owner: 10Dzahn) [13:35:41] (03PS3) 10Alexandros Kosiaris: osm: move files/osm/tuning.conf to role module [puppet] - 10https://gerrit.wikimedia.org/r/318453 (owner: 10Dzahn) [13:35:43] (03CR) 10Alexandros Kosiaris: [V: 032] osm: move files/osm/tuning.conf to role module [puppet] - 10https://gerrit.wikimedia.org/r/318453 (owner: 10Dzahn) [13:36:00] zeljkof: UE investigated a proposal to use *summer* time all year some years ago (around 2012), but only Estonia and Belgium was interested, other countries are fine with DST change (good news is none was interested by back to winter time) [13:36:40] Dereckson: but wait, winter time is the correct one, summer time is wrong [13:37:04] Er... "right" is here defined as sun higher around noon? [13:37:24] but that also depends on where in Europe you live, I live at a place that the actual noon is just a few minutes away for the clock noon [13:37:54] 06Operations, 10ops-codfw: Broken disk in labstore2001 - https://phabricator.wikimedia.org/T149567#2763304 (10chasemp) >>! In T149567#2759936, @Papaul wrote: > Re-opening this task since disk in slot 5 on labstore2001 is bad. @papaul have you replaced the failed drive and if not can you please? I believe i... [13:37:55] yes, correct time is when the shadows are the shortest at noon sharp [13:39:47] zeljkof: I'm not sure that's really a priority. Time is rather more a convention than a natural cycle to follow. The original DST goal was to consume less energy, especially at evening when you've the energy consumption peak. Estonia argument was the tourism sector will benefit of more frequentation late in october/november if there is one hour more of light. [13:40:35] TZ is wrong anyway http://www.slate.com/blogs/the_world_/2014/02/21/how_wrong_is_your_time_zone_map_shows_how_far_ahead_or_behind_the_world.html [13:40:44] that map shows the shift between the TZ based time and the solar time [13:40:53] (03CR) 10DCausse: [C: 031] elasticsearch: /etc/elasticsearch/scripts is not used anymore [puppet] - 10https://gerrit.wikimedia.org/r/319048 (owner: 10Gehel) [13:40:54] actually, for us all here, what we really would benefit from is "all the planet uses UTC" [13:40:55] full size http://i0.wp.com/poisson.phc.unipi.it/~maggiolo/wp-content/uploads/2014/01/SolarTimeVsStandardTime.png [13:42:20] yurik: change is still syncing .. I ran a full scap sync :( [13:42:26] (03PS1) 10Bmansurov: MF Beta: Enable moving first paragraph before infobox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319319 (https://phabricator.wikimedia.org/T145216) [13:42:51] hashar, with l10 rebuild? [13:42:58] yeah [13:43:11] heh, might take some time ) [13:43:25] hello deployers, I've added https://gerrit.wikimedia.org/r/#/c/319319/ to the current queue. Do we have enough time to deploy the patch? [13:44:26] (03CR) 10Ottomata: "It is in the param documentation for the $streams parameter there. Check out this vars.yaml:" [puppet] - 10https://gerrit.wikimedia.org/r/319129 (https://phabricator.wikimedia.org/T148779) (owner: 10Ottomata) [13:44:39] hashar, could you move zerowiki to group 1 plz? [13:44:52] bmansurov: there is a full scap in progress, ping in 20 minutes [13:44:57] it should'nt be the one to crash first ) [13:45:02] Dereckson: ok thanks [13:45:17] bmansurov, is that a graph patch?? :) [13:45:22] yurik: nop [13:45:34] not outside of the scheduled train really [13:45:37] yurik: soon, i need to get some design input first [13:45:50] yurik: i left some questions for you the other day [13:45:56] hehe. bmansurov which topic? [13:46:01] yurik: i'll ping you again when I work on graph [13:46:11] yurik: about styling the sandbox page [13:46:20] i don't think i saw them [13:46:31] ok, i'll ping you on the interactive channel [13:46:44] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:47:00] yurik: canaries synced [13:47:27] * yurik wonders what cooked canaries taste like [13:51:00] (03CR) 10Faidon Liambotis: [C: 04-1] "A couple of small comments, otherwise good to go." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/319083 (owner: 10Ema) [13:51:12] hashar, its alive!!:) [13:51:25] half synced yeah [13:51:27] seems to be working. finally. Only took a month to fix it [13:52:01] the more interesting question is - will it work after group 1 [13:56:08] !log hashar@tin Finished scap: ZeroBanner / ZeroPortal extensions.json fix (duration: 22m 01s) [13:56:13] yurik: done [13:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:45] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 - https://phabricator.wikimedia.org/T148478#2763356 (10ArielGlenn) Note that since the parallel GCs and CMS use pretty much the same algorithm when doing minor... [13:58:26] !log European SWAT complete [13:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:14] thanks hashar ! [14:01:03] 06Operations, 10Wikimedia-General-or-Unknown: Icinga has httpauth on (not accessible for public) - https://phabricator.wikimedia.org/T62112#661810 (10MoritzMuehlenhoff) Although we do have a security-support version of Icinga running now, I'm against opening up the access to the public. Icinga/Nagios have a no... [14:01:41] 06Operations, 10ops-codfw, 06DC-Ops, 07Wikimedia-Incident: Labstore2001 controller or shelf failure - https://phabricator.wikimedia.org/T102626#2763378 (10chasemp) I'm somewhat confused on the status of labstore2001 honestly. > labstore2001: sudo megacli -AdpAllInfo -aAll | grep -i PERC > Product Name... [14:03:10] RECOVERY - puppet last run on db1038 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [14:03:33] Dereckson: did I miss the train or is there time? [14:05:33] you did [14:05:58] ok, I'll reschedule my patch [14:06:09] jouncebot: now [14:06:09] No deployments scheduled for the next 3 hour(s) and 53 minute(s) [14:06:18] jouncebot: next [14:06:19] In 3 hour(s) and 53 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161102T1800) [14:06:42] jouncebot: now [14:06:42] No deployments scheduled for the next 3 hour(s) and 53 minute(s) [14:06:46] jouncebot: next [14:06:46] In 3 hour(s) and 53 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161102T1800) [14:06:53] jouncebot: refresh [14:06:56] I refreshed my knowledge about deployments. [14:07:08] jouncebot: next [14:07:08] In 3 hour(s) and 52 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161102T1800) [14:07:22] bmansurov: I can push your change now [14:07:30] oh cool, hashar [14:07:33] thanks [14:07:52] which one ? :D [14:08:19] hashar: https://gerrit.wikimedia.org/r/#/c/319319/ [14:08:36] 06Operations, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2763391 (10Anomie) >>! In T149408#2762677, @mobrovac wrote: > But, correct me if I'm wrong, but small wikis run the jobs... [14:09:14] (03CR) 10Hashar: [C: 032] MF Beta: Enable moving first paragraph before infobox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319319 (https://phabricator.wikimedia.org/T145216) (owner: 10Bmansurov) [14:09:22] ideally [14:09:32] those kind of changes should probably be landed by the team owning the service [14:09:34] at any time :D [14:09:52] (03Merged) 10jenkins-bot: MF Beta: Enable moving first paragraph before infobox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319319 (https://phabricator.wikimedia.org/T145216) (owner: 10Bmansurov) [14:10:09] i see, hashar, I'll check if anyone on my team can SWAT [14:10:37] pulled the change on mw1099 [14:10:46] reason to remove the //taskrefs history from the file tho? [14:10:58] bmansurov: phuedx / joakino are definitely interested in deploying :D [14:11:12] then there are not so many mobile related changes to push [14:11:35] hashar: i see the change mw1099, thanks for deploying [14:11:53] so all green for a full sync ?:} [14:11:58] hashar: we'll try to own these kinds of deployments in the future [14:12:02] hashar: yes [14:12:20] unleashing [14:12:37] !log rebooting labvirt1014 for kernel update [14:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:04] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: MF Beta: Enable moving first paragraph before infobox - T145216 (duration: 00m 47s) [14:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:11] T145216: MobileFormatter should relocate first paragraph ahead of infobox - https://phabricator.wikimedia.org/T145216 [14:13:17] hashar: thanks again [14:13:19] bmansurov: or we can keep deploying for you it is up to your team. Though if mobile had the possibility to deploy by itself that gives your team more agility / less dependency [14:13:25] you are welcome :} [14:13:33] yes, I agree [14:14:30] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [14:14:40] PROBLEM - puppet last run on mw1201 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:17:02] (03CR) 10Elukey: "Any feedback?" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/314662 (https://phabricator.wikimedia.org/T147436) (owner: 10Elukey) [14:18:39] 06Operations, 10ops-codfw: Broken disk in labstore2001 - https://phabricator.wikimedia.org/T149567#2763404 (10Papaul) I mentioned this yesterday "Re-opening this task since disk in slot 5 on labstore2001 is bad." and i open a task T149693. Rob is already working on ordering some disk. [14:19:58] (03PS1) 10Bmansurov: MF Beta: Don't move first paragraph before infobox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319326 (https://phabricator.wikimedia.org/T145216) [14:23:06] (03PS1) 10Jdrewniak: Updating wikipedia.org portal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319327 [14:25:38] 06Operations, 10ops-codfw: Broken disk in labstore2001 - https://phabricator.wikimedia.org/T149567#2763430 (10chasemp) [16:08:22] <_joe_> !log banning dbtree.wikimedia.org on cache_misc, T149357 [16:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:28] T149357: dbtree broken - https://phabricator.wikimedia.org/T149357 [16:08:54] 06Operations, 10DBA, 13Patch-For-Review: dbtree broken - https://phabricator.wikimedia.org/T149357#2763895 (10Joe) 05Open>03Resolved p:05Triage>03High [16:18:11] (03PS1) 10Rush: labstore: 2003/2004 add backup key [puppet] - 10https://gerrit.wikimedia.org/r/319352 [16:20:13] (03CR) 10Rush: [C: 032] labstore: 2003/2004 add backup key [puppet] - 10https://gerrit.wikimedia.org/r/319352 (owner: 10Rush) [16:20:17] (03PS2) 10Rush: labstore: 2003/2004 add backup key [puppet] - 10https://gerrit.wikimedia.org/r/319352 [16:20:19] (03CR) 10Rush: [V: 032] labstore: 2003/2004 add backup key [puppet] - 10https://gerrit.wikimedia.org/r/319352 (owner: 10Rush) [16:22:54] (03CR) 10Subramanya Sastry: "Good catch! :)" [puppet] - 10https://gerrit.wikimedia.org/r/319268 (owner: 10Dzahn) [16:25:10] 06Operations, 10ops-eqiad, 10media-storage: ms-be1005 - MegaRAID - CRITICAL: 1 failed LD(s) (Offline) - https://phabricator.wikimedia.org/T149069#2763905 (10Cmjohnson) Disk has been swapped [16:26:48] 06Operations, 06Operations-Software-Development: wmf-reimage and handling of "-n" option - https://phabricator.wikimedia.org/T144264#2763907 (10Volans) [16:27:41] PROBLEM - puppet last run on labstore2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:28:02] 06Operations, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2763914 (10greg) [16:33:11] RECOVERY - MegaRAID on ms-be1005 is OK: OK: optimal, 13 logical, 13 physical [16:36:38] _joe_, wasn't around then, sorry [16:36:41] thanks for fixing it [16:37:08] <_joe_> np :) [16:41:51] (03PS2) 10Gehel: maps - create postgresql database for tiles storage [puppet] - 10https://gerrit.wikimedia.org/r/318954 (https://phabricator.wikimedia.org/T147223) [16:45:16] 06Operations, 10ops-eqiad, 10media-storage: ms-be1005 - MegaRAID - CRITICAL: 1 failed LD(s) (Offline) - https://phabricator.wikimedia.org/T149069#2763951 (10Cmjohnson) 05Open>03Resolved Disk has been swapped, clrd foreign cfg, removed cache and added disk back cmjohnson@ms-be1005:~$ sudo megacli -CfgFo... [16:48:24] (03CR) 10Filippo Giunchedi: [C: 031] site: add varnish_exporter to esams/eqiad maps/misc [puppet] - 10https://gerrit.wikimedia.org/r/319291 (https://phabricator.wikimedia.org/T147424) (owner: 10Ema) [16:48:57] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 13Patch-For-Review, 15User-Joe: Set up docker building environment for production - https://phabricator.wikimedia.org/T149812#2763958 (10Joe) p:05Triage>03Normal [16:51:10] (03CR) 10Filippo Giunchedi: [C: 031] prometheus: extend Varnish targets generation to text/upload [puppet] - 10https://gerrit.wikimedia.org/r/319293 (https://phabricator.wikimedia.org/T147424) (owner: 10Ema) [16:51:31] 06Operations, 10ops-eqiad, 10media-storage: ms-be1001 - disk failure /dev/sdf1 - https://phabricator.wikimedia.org/T149073#2763967 (10Cmjohnson) 05Open>03Resolved Swapped disk, cleared foreign cfg, added VD back cmjohnson@ms-be1001:~$ sudo megacli -PDList -aALL |grep "Firmware state:" Firmware state: O... [16:51:47] !log copy labstore2001 tools backup to 2003 and others backup to 2004 for emergency maint [16:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:12] (03CR) 10Filippo Giunchedi: "Since this will effectively deploy varnish_exporter to all caches I'd move the role varnish_exporter directly inside the cache role and re" [puppet] - 10https://gerrit.wikimedia.org/r/319295 (https://phabricator.wikimedia.org/T147424) (owner: 10Ema) [16:52:56] RECOVERY - MegaRAID on ms-be1001 is OK: OK: optimal, 14 logical, 14 physical [16:55:45] Hey marostegui - I have a DBA question. How hard/easy will it be to add a new column to the revision table? [16:57:40] (03PS3) 10Muehlenhoff: Add explicit dependency on ghostscript [puppet] - 10https://gerrit.wikimedia.org/r/313963 [16:58:21] 06Operations, 10ops-eqiad: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T149728#2763977 (10Cmjohnson) New disk requested through Dell Tech Direct Portal. [16:58:56] RECOVERY - puppet last run on ms-be1005 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [16:59:13] 06Operations, 10Traffic: restrict upload cache access for private wikis - https://phabricator.wikimedia.org/T129839#2763978 (10fgiunchedi) @bblack correct, for private wikis MW will go through img_auth.php and from there to ms-fe IIRC. I believe we can deny access to private wikis on upload altogether. [17:02:16] RECOVERY - puppet last run on ms-be1001 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [17:09:03] 06Operations, 10DNS, 10Domains, 10Traffic, and 2 others: Point wikipedia.in to 205.147.101.160 instead of URL forward - https://phabricator.wikimedia.org/T144508#2764038 (10CRoslof) >>! In T144508#2740901, @Naveenpf wrote: > @Aklapper Can you please change title to.... add new IP address ? We have changed... [17:09:30] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services, and 4 others: Integrating MediaWiki (and other services) with dynamic configuration - https://phabricator.wikimedia.org/T149617#2764040 (10Joe) [17:10:26] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services, and 4 others: Integrating MediaWiki (and other services) with dynamic configuration - https://phabricator.wikimedia.org/T149617#2758050 (10Joe) [17:13:33] 06Operations, 10Traffic: restrict upload cache access for private wikis - https://phabricator.wikimedia.org/T129839#2764056 (10BBlack) Ok, I see we have a list of the private wikis in puppet already in `$private_wikis` from `manifests/realm.pp`. Is it simple to transform those into upload paths to deny access... [17:17:46] RECOVERY - check_raid on payments2002 is OK: OK: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 OK] [17:20:06] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 643 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3067067 keys, up 2 days 8 hours - replication_delay is 643 [17:20:27] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services, and 4 others: Integrating MediaWiki (and other services) with dynamic configuration - https://phabricator.wikimedia.org/T149617#2764068 (10jcrespo) @aaron Asks every week about this on multi-dc group, so pinging him. :-) [17:21:30] (03PS2) 10Ema: site: add varnish_exporter to esams/eqiad maps/misc [puppet] - 10https://gerrit.wikimedia.org/r/319291 (https://phabricator.wikimedia.org/T147424) [17:21:36] (03CR) 10Ema: [C: 032 V: 032] site: add varnish_exporter to esams/eqiad maps/misc [puppet] - 10https://gerrit.wikimedia.org/r/319291 (https://phabricator.wikimedia.org/T147424) (owner: 10Ema) [17:22:40] (03PS1) 10Rush: labstore: 2003/2004 hiera additons to allow labstore backup key [puppet] - 10https://gerrit.wikimedia.org/r/319359 [17:24:15] (03CR) 10Rush: [C: 032] labstore: 2003/2004 hiera additons to allow labstore backup key [puppet] - 10https://gerrit.wikimedia.org/r/319359 (owner: 10Rush) [17:24:19] (03PS2) 10Rush: labstore: 2003/2004 hiera additons to allow labstore backup key [puppet] - 10https://gerrit.wikimedia.org/r/319359 [17:25:04] (03PS2) 10Ema: prometheus: extend Varnish targets generation to text/upload [puppet] - 10https://gerrit.wikimedia.org/r/319293 (https://phabricator.wikimedia.org/T147424) [17:25:09] (03CR) 10Rush: [V: 032] labstore: 2003/2004 hiera additons to allow labstore backup key [puppet] - 10https://gerrit.wikimedia.org/r/319359 (owner: 10Rush) [17:25:12] (03CR) 10Ema: [C: 032 V: 032] prometheus: extend Varnish targets generation to text/upload [puppet] - 10https://gerrit.wikimedia.org/r/319293 (https://phabricator.wikimedia.org/T147424) (owner: 10Ema) [17:25:25] (03PS3) 10Ema: prometheus: extend Varnish targets generation to text/upload [puppet] - 10https://gerrit.wikimedia.org/r/319293 (https://phabricator.wikimedia.org/T147424) [17:25:29] (03CR) 10Ema: [V: 032] prometheus: extend Varnish targets generation to text/upload [puppet] - 10https://gerrit.wikimedia.org/r/319293 (https://phabricator.wikimedia.org/T147424) (owner: 10Ema) [17:28:36] RECOVERY - puppet last run on labstore2004 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [17:29:17] 06Operations, 06Services (next), 15User-Joe, 15User-mobrovac, 05codfw-rollout: Create a service location / discovery system for locating local/master resources easily across all WMF applications - https://phabricator.wikimedia.org/T125069#2764150 (10GWicke) For all use cases I can think of in services re... [17:30:24] (03PS2) 10Jcrespo: tendril+dbtree: Explicitly disable automatic pulls from HEAD [puppet] - 10https://gerrit.wikimedia.org/r/319301 (https://phabricator.wikimedia.org/T149340) [17:31:01] 06Operations, 10ops-eqiad: Add thermal paste to einsteinium - https://phabricator.wikimedia.org/T149685#2764151 (10Cmjohnson) 05Open>03Resolved thermal paste re-applied. [17:31:49] (03CR) 10Jcrespo: [C: 032] "The reason for this is that, unlike mediawiki, HEAD changes cannot be tested before production deployment. This could be potentially dange" [puppet] - 10https://gerrit.wikimedia.org/r/319301 (https://phabricator.wikimedia.org/T149340) (owner: 10Jcrespo) [17:31:51] 06Operations, 10ops-eqiad, 10hardware-requests, 06Services (watching): Reclaim aqs100[123] - https://phabricator.wikimedia.org/T147926#2764153 (10Cmjohnson) a:05Cmjohnson>03RobH @robh: controllers have been swapped [17:32:06] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3056489 keys, up 2 days 9 hours - replication_delay is 0 [17:32:07] !log clean syslog/daemon.log on lithium, spam from mtail [17:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:37] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 4 others: Integrating MediaWiki (and other services) with dynamic configuration - https://phabricator.wikimedia.org/T149617#2764172 (10Pchelolo) [17:35:25] 06Operations, 10ops-eqiad, 10hardware-requests, 06Services (watching): Reclaim aqs100[123] - https://phabricator.wikimedia.org/T147926#2764182 (10RobH) a:05RobH>03Cmjohnson just chatted with chris, disks still need to be wiped. assigning back to him until thats compelted [17:41:40] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 11 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2764193 (10GWicke) [17:44:23] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 11 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2764211 (10GWicke) [17:46:48] (03PS1) 10Faidon Liambotis: mirrors: add rsync for Debian's push mirroring [puppet] - 10https://gerrit.wikimedia.org/r/319364 [17:48:06] (03CR) 10jenkins-bot: [V: 04-1] mirrors: add rsync for Debian's push mirroring [puppet] - 10https://gerrit.wikimedia.org/r/319364 (owner: 10Faidon Liambotis) [17:49:15] (03PS2) 10Faidon Liambotis: mirrors: add rsync for Debian's push mirroring [puppet] - 10https://gerrit.wikimedia.org/r/319364 [17:49:38] (03PS1) 10Rush: WIP: candidate idea for secondary backups [puppet] - 10https://gerrit.wikimedia.org/r/319365 [17:49:45] (03PS2) 10Rush: WIP: candidate idea for secondary backups [puppet] - 10https://gerrit.wikimedia.org/r/319365 [17:50:51] (03CR) 10Faidon Liambotis: [C: 032] mirrors: add rsync for Debian's push mirroring [puppet] - 10https://gerrit.wikimedia.org/r/319364 (owner: 10Faidon Liambotis) [17:50:59] 06Operations, 10Analytics-Cluster, 10Packaging: libcglib3-java replaces libcglib-java in Jessie - https://phabricator.wikimedia.org/T137791#2764216 (10MoritzMuehlenhoff) I think so [17:51:06] (03CR) 10jenkins-bot: [V: 04-1] WIP: candidate idea for secondary backups [puppet] - 10https://gerrit.wikimedia.org/r/319365 (owner: 10Rush) [17:51:30] 06Operations, 10Wikimedia-Logstash: Get 5xx logs into kibana/logstash - https://phabricator.wikimedia.org/T149451#2764219 (10MoritzMuehlenhoff) p:05Triage>03Normal [17:51:30] jynus: I'm puppet-merging a change of yours [17:51:45] 06Operations, 10vm-requests: Site: 2 VM request for tendril - https://phabricator.wikimedia.org/T149557#2764220 (10MoritzMuehlenhoff) p:05Triage>03Normal [17:52:11] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 11 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2764221 (10GWicke) [17:52:16] !log deploying new GUI on wdqs [17:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:34] (03CR) 10MaxSem: maps - create postgresql database for tiles storage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/318954 (https://phabricator.wikimedia.org/T147223) (owner: 10Gehel) [17:56:02] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Prepare and check production and labs-side filtering for olowiki - https://phabricator.wikimedia.org/T147302#2764229 (10jcrespo) [17:56:06] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2764226 (10jcrespo) 05Open>03Resolved a:03jcrespo I am assuming this as resolved because I think it is done and saw no complain it is not working fr... [17:56:47] SMalyshev: wdqs new GUI is deployed, tests looks good, feel free to have a look... [17:58:57] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 4 others: Integrating MediaWiki (and other services) with dynamic configuration - https://phabricator.wikimedia.org/T149617#2764233 (10greg) [17:59:25] (03CR) 10Gehel: maps - create postgresql database for tiles storage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/318954 (https://phabricator.wikimedia.org/T147223) (owner: 10Gehel) [18:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161102T1800). Please do the needful. [18:00:04] bmansurov, jan_drewniak, Urbanecm, and ebernhardson: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:11] Present [18:01:35] o/ [18:02:31] !log rebooting labvirt1012 [18:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:44] I can SWAT today [18:02:49] bmansurov: ping for SWAT [18:04:11] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319327 (owner: 10Jdrewniak) [18:04:58] (03Merged) 10jenkins-bot: Updating wikipedia.org portal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319327 (owner: 10Jdrewniak) [18:05:45] (03PS1) 10Faidon Liambotis: mirrors: allow public rsync access to Debian [puppet] - 10https://gerrit.wikimedia.org/r/319367 [18:05:47] (03PS1) 10Faidon Liambotis: mirrors: config ftpsync for mirroring from debian.org [puppet] - 10https://gerrit.wikimedia.org/r/319368 [18:06:04] jan_drewniak: pulled on mw1099, unsure if you can check portals from there, but if you can please do. [18:07:02] (03CR) 10jenkins-bot: [V: 04-1] mirrors: allow public rsync access to Debian [puppet] - 10https://gerrit.wikimedia.org/r/319367 (owner: 10Faidon Liambotis) [18:07:22] thcipriani: yup, it's fine on mw1099 [18:07:50] (03CR) 10jenkins-bot: [V: 04-1] mirrors: config ftpsync for mirroring from debian.org [puppet] - 10https://gerrit.wikimedia.org/r/319368 (owner: 10Faidon Liambotis) [18:07:54] jan_drewniak: cool. Going live everywhere [18:08:14] (03PS2) 10Faidon Liambotis: mirrors: config ftpsync for mirroring from debian.org [puppet] - 10https://gerrit.wikimedia.org/r/319368 [18:08:16] (03PS2) 10Faidon Liambotis: mirrors: allow public rsync access to Debian [puppet] - 10https://gerrit.wikimedia.org/r/319367 [18:09:10] !log thcipriani@tin Synchronized portals/prod/wikipedia.org/assets: SWAT: [[gerrit:319327|Updating wikipedia.org portal (T128546) (T135441)]] (duration: 00m 47s) [18:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:17] T128546: [Recurring Task] Update Wikipedia.org Portal and sister Wiki's statistics - https://phabricator.wikimedia.org/T128546 [18:09:18] T135441: [EPIC] Wikipedia.org Portal: add app download badge - https://phabricator.wikimedia.org/T135441 [18:09:57] !log thcipriani@tin Synchronized portals: SWAT: [[gerrit:319327|Updating wikipedia.org portal (T128546) (T135441)]] (duration: 00m 47s) [18:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:05] ^ jan_drewniak live everywhere [18:10:22] (03CR) 10jenkins-bot: [V: 04-1] mirrors: config ftpsync for mirroring from debian.org [puppet] - 10https://gerrit.wikimedia.org/r/319368 (owner: 10Faidon Liambotis) [18:10:42] thcipriani: great, thanks! [18:10:59] Urbanecm: you're up :) [18:11:05] Okay :) [18:11:11] Ready for testing [18:11:21] !log Sending final Tool Labs survey reminder emails from silver (T147336) [18:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:27] T147336: 2016 Tool Labs user survey - https://phabricator.wikimedia.org/T147336 [18:11:28] (03PS2) 10Thcipriani: Add maiwiki HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319330 (https://phabricator.wikimedia.org/T149790) (owner: 10Urbanecm) [18:11:40] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319330 (https://phabricator.wikimedia.org/T149790) (owner: 10Urbanecm) [18:11:55] (03PS3) 10Faidon Liambotis: mirrors: config ftpsync for mirroring from debian.org [puppet] - 10https://gerrit.wikimedia.org/r/319368 [18:12:01] (03CR) 10Faidon Liambotis: [C: 032] mirrors: allow public rsync access to Debian [puppet] - 10https://gerrit.wikimedia.org/r/319367 (owner: 10Faidon Liambotis) [18:12:17] (03Merged) 10jenkins-bot: Add maiwiki HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319330 (https://phabricator.wikimedia.org/T149790) (owner: 10Urbanecm) [18:12:54] Urbanecm: live on mw1099, check please [18:13:00] Going to do it... [18:13:14] (03CR) 10Faidon Liambotis: [C: 032] mirrors: config ftpsync for mirroring from debian.org [puppet] - 10https://gerrit.wikimedia.org/r/319368 (owner: 10Faidon Liambotis) [18:13:39] thcipriani: All is okay, please deploy it everywhere. [18:13:46] Urbanecm: ok, going live [18:15:43] !log thcipriani@tin Synchronized static/images/project-logos: SWAT: [[gerrit:319330|Add maiwiki HD logos (T149790)]] PART I (duration: 00m 47s) [18:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:50] T149790: Update Maithili Wikipedia Logo - https://phabricator.wikimedia.org/T149790 [18:15:50] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2052 - https://phabricator.wikimedia.org/T149377#2764317 (10Papaul) a:05Papaul>03jcrespo Dick replacement complete. [18:17:23] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:319330|Add maiwiki HD logos (T149790)]] PART II (duration: 00m 47s) [18:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:38] thcipriani: Thanks for the deploy! [18:19:01] thcipriani: pong ;) [18:19:12] 06Operations, 10ops-codfw: Predictive disk failure on db2047 - https://phabricator.wikimedia.org/T149670#2764332 (10Papaul) a:05Papaul>03Marostegui Disk placement complete. [18:19:21] Urbanecm: yw :) [18:19:28] ebernhardson: hello :) [18:19:52] mines real minor, removes a couple options from an array in javascript [18:20:01] 06Operations, 10ops-codfw, 10fundraising-tech-ops: payments2002 disk failure - https://phabricator.wikimedia.org/T149646#2764363 (10Papaul) Received the disk but it was the wrong one. I had to send it back for replacement. [18:20:04] PROBLEM - Ubuntu mirror in sync with upstream on sodium is CRITICAL: /srv/mirrors/ubuntu is over 12 hours old. [18:20:42] yep. Now we wait for jenkins :) [18:22:30] !log rebooting labvirt1013 [18:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:21] ebernhardson: I pulled both changes down to mw1099, if there's anything you want to check there. [18:25:44] er, same change for both branches [18:27:15] thcipriani: looks sane [18:27:28] ok. going live everywhere, wmf.1 first [18:29:45] !log thcipriani@tin Synchronized php-1.29.0-wmf.1/extensions/WikimediaEvents/modules/ext.wikimediaEvents.searchSatisfaction.js: SWAT: [[gerrit:319363|Turn off Cirrus AB test on zh and ja (T147499)]] (duration: 00m 47s) [18:29:49] (03PS1) 10Faidon Liambotis: mirrrors: set up push mirroring for Debian [puppet] - 10https://gerrit.wikimedia.org/r/319371 [18:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:52] T147499: Turn off second BM25 test - https://phabricator.wikimedia.org/T147499 [18:31:16] !log thcipriani@tin Synchronized php-1.28.0-wmf.23/extensions/WikimediaEvents/modules/ext.wikimediaEvents.searchSatisfaction.js: SWAT: [[gerrit:319362|Turn off Cirrus AB test on zh and ja (T147499)]] (duration: 00m 46s) [18:31:18] ^ ebernhardson live everywhere [18:31:20] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2764404 (10jcrespo) Actually there was one issue with entry point (using the socket instead of the hostname), and another one with a check which required... [18:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:28] (03CR) 10Faidon Liambotis: [C: 032] mirrrors: set up push mirroring for Debian [puppet] - 10https://gerrit.wikimedia.org/r/319371 (owner: 10Faidon Liambotis) [18:31:33] 06Operations, 10RESTBase, 10Traffic, 06Services (doing): Restbase redirects with cors not working on Android 4 native browser - https://phabricator.wikimedia.org/T149295#2764406 (10GWicke) @pchelolo and I discussed this a bit in person, and decided to go with a RESTBase-only solution for now. Concretely, w... [18:32:54] thcipriani: thanks! [18:33:33] bmansurov: ping me if you're around for SWAT before 19:00UTC [18:34:29] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2052 - https://phabricator.wikimedia.org/T149377#2764412 (10jcrespo) [18:34:31] 06Operations, 10ops-codfw: Degraded RAID on db2052 - https://phabricator.wikimedia.org/T149703#2764414 (10jcrespo) [18:35:44] PROBLEM - puppet last run on maps1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:35:56] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2052 - https://phabricator.wikimedia.org/T149377#2750453 (10jcrespo) Waiting for a complete rebuild to close this https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=db2052&service=HP+RAID [18:37:53] (03PS1) 10Rush: labsdb: maintain-views use control socket not host [puppet] - 10https://gerrit.wikimedia.org/r/319375 [18:38:00] 06Operations, 10ops-codfw: Predictive disk failure on db2047 - https://phabricator.wikimedia.org/T149670#2764423 (10jcrespo) Rebuilding, waiting for it to complete to close: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=db2047&service=HP+RAID [18:38:52] (03PS2) 10Rush: labsdb: maintain-views use control socket not host [puppet] - 10https://gerrit.wikimedia.org/r/319375 [18:39:53] 06Operations, 10ops-codfw: es2019 crashed again - https://phabricator.wikimedia.org/T149526#2764433 (10jcrespo) Open so schedule it when you have the time. Should we update the BIOS (again?). CC'ing @robh here. [18:40:20] (03PS1) 10Faidon Liambotis: mirrors: workaround a ferm @resolve bug with v4/v6 [puppet] - 10https://gerrit.wikimedia.org/r/319376 [18:42:51] (03PS3) 10Rush: labsdb: maintain-views use control socket not host [puppet] - 10https://gerrit.wikimedia.org/r/319375 [18:43:11] (03CR) 10Faidon Liambotis: [C: 032] mirrors: workaround a ferm @resolve bug with v4/v6 [puppet] - 10https://gerrit.wikimedia.org/r/319376 (owner: 10Faidon Liambotis) [18:44:04] RECOVERY - Ubuntu mirror in sync with upstream on sodium is OK: /srv/mirrors/ubuntu is over 0 hours old. [18:50:07] thcipriani: i'm here [18:50:24] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 4 others: Integrating MediaWiki (and other services) with dynamic configuration - https://phabricator.wikimedia.org/T149617#2758050 (10GWicke) The discussion in {T125069} is highly relevant here. I still think that DNS... [18:50:39] bmansurov: ok, ready to get your change out? [18:50:45] yes [18:51:19] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319326 (https://phabricator.wikimedia.org/T145216) (owner: 10Bmansurov) [18:51:21] (03PS4) 10Rush: labsdb: maintain-views use control socket not host [puppet] - 10https://gerrit.wikimedia.org/r/319375 [18:51:23] (03PS2) 10Thcipriani: MF Beta: Don't move first paragraph before infobox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319326 (https://phabricator.wikimedia.org/T145216) (owner: 10Bmansurov) [18:51:30] (03CR) 10Thcipriani: MF Beta: Don't move first paragraph before infobox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319326 (https://phabricator.wikimedia.org/T145216) (owner: 10Bmansurov) [18:51:36] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319326 (https://phabricator.wikimedia.org/T145216) (owner: 10Bmansurov) [18:51:41] oh gerrit. [18:51:50] (03CR) 10Rush: [C: 032 V: 032] labsdb: maintain-views use control socket not host [puppet] - 10https://gerrit.wikimedia.org/r/319375 (owner: 10Rush) [18:52:16] (03Merged) 10jenkins-bot: MF Beta: Don't move first paragraph before infobox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319326 (https://phabricator.wikimedia.org/T145216) (owner: 10Bmansurov) [18:52:52] 06Operations, 10hardware-requests: Move labsdb1008 to production, rename it back to db1095, use it as a temporary sanitarium - https://phabricator.wikimedia.org/T149829#2764445 (10jcrespo) [18:53:48] bmansurov: changes are live on mw1099, although might not be much to test there, let me know if it's good to go live. [18:53:53] 06Operations, 10ops-eqiad, 10hardware-requests: Move labsdb1008 to production, rename it back to db1095, use it as a temporary sanitarium - https://phabricator.wikimedia.org/T149829#2764461 (10jcrespo) [18:54:02] thcipriani: I see the change there. good to go [18:54:09] ok, going live everywhere [18:55:58] thcipriani: now ? [18:56:29] thcipriani: thanks [18:56:30] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:319326|MF Beta: Do not move first paragraph before infobox (T145216)]] (duration: 00m 49s) [18:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:37] T145216: MobileFormatter should relocate first paragraph ahead of infobox - https://phabricator.wikimedia.org/T145216 [18:56:37] ^ bmansurov change is live [18:57:02] matanya: what do you mean? I'm syncing out the last of the swat changes. [18:57:15] thcipriani: looks good. [18:58:50] thcipriani: i was thinking you are going for 1.29-wmf.1 for group1 [18:58:56] oh :) [18:59:17] nope twentyafterfour will be handling that here in a few :) [18:59:48] (03PS1) 10Rush: labsdb: update to match private for maintainviews [puppet] - 10https://gerrit.wikimedia.org/r/319381 [19:00:04] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161102T1900). Please do the needful. [19:00:59] 06Operations, 10ops-eqiad, 10DBA, 06Labs, and 2 others: Move dbproxy1010 and dbproxy1011 to labs-support network, rename them to labsdbproxy1001 and labsdbproxy1002 - https://phabricator.wikimedia.org/T149170#2764508 (10jcrespo) Labs and DBAs agree this should go on. @robh @Cmjohnson Is this something we... [19:02:09] (03PS2) 10Rush: labsdb: update to match private for maintainviews [puppet] - 10https://gerrit.wikimedia.org/r/319381 [19:02:17] (03CR) 10Rush: [C: 032 V: 032] labsdb: update to match private for maintainviews [puppet] - 10https://gerrit.wikimedia.org/r/319381 (owner: 10Rush) [19:03:37] (03PS1) 10Dzahn: mgmt: script to extract mgmt IPs from DNS [puppet] - 10https://gerrit.wikimedia.org/r/319383 (https://phabricator.wikimedia.org/T147074) [19:03:51] RECOVERY - puppet last run on maps1003 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [19:05:05] * twentyafterfour prepares to do the needful [19:05:38] (03PS2) 10Dzahn: mgmt: script to extract mgmt IPs from DNS [puppet] - 10https://gerrit.wikimedia.org/r/319383 (https://phabricator.wikimedia.org/T147074) [19:06:44] 06Operations, 10Ops-Access-Requests: access to wm-log-reader for viewing logs - https://phabricator.wikimedia.org/T149832#2764536 (10Matanya) [19:07:05] (03CR) 10Dzahn: [C: 032] mgmt: script to extract mgmt IPs from DNS [puppet] - 10https://gerrit.wikimedia.org/r/319383 (https://phabricator.wikimedia.org/T147074) (owner: 10Dzahn) [19:08:08] !log rebooting asw2-d-eqiad [19:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:18] (03PS1) 10Rush: labstore: apply labs::db::views [puppet] - 10https://gerrit.wikimedia.org/r/319384 [19:09:49] twentyafterfour: please hold a few while i trying to verify a fatal is not user facing, and hence a blocker for the release, thanks [19:10:05] matanya: ok [19:10:37] Blockers for wmf.1: T149059 [19:10:38] T149059: MW-1.29.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T149059 [19:11:31] PROBLEM - Host asw2-d-eqiad.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [19:11:42] (03CR) 10Rush: [C: 032] labstore: apply labs::db::views [puppet] - 10https://gerrit.wikimedia.org/r/319384 (owner: 10Rush) [19:12:51] RECOVERY - Host asw2-d-eqiad.mgmt.eqiad.wmnet is UP: PING OK - Packet loss = 0%, RTA = 36.94 ms [19:20:34] 06Operations, 06Analytics-Kanban, 10EventBus: Open up kafka1003 port 9092 in Analytics vlan ACL - https://phabricator.wikimedia.org/T149835#2764651 (10Ottomata) [19:20:36] (03PS1) 10Dzahn: mgmt: follow-up fix to getmgmtips script [puppet] - 10https://gerrit.wikimedia.org/r/319386 [19:21:24] (03PS2) 10Dzahn: mgmt: follow-up fix to getmgmtips script [puppet] - 10https://gerrit.wikimedia.org/r/319386 [19:21:28] (03CR) 10Dzahn: [C: 032] mgmt: follow-up fix to getmgmtips script [puppet] - 10https://gerrit.wikimedia.org/r/319386 (owner: 10Dzahn) [19:23:27] (03CR) 10Papaul: [V: 032] mgmt: follow-up fix to getmgmtips script [puppet] - 10https://gerrit.wikimedia.org/r/319386 (owner: 10Dzahn) [19:25:34] 06Operations, 06Analytics-Kanban, 10EventBus: Open up kafka1003 port 9092 in Analytics vlan ACL - https://phabricator.wikimedia.org/T149835#2764698 (10faidon) 05Open>03Resolved Done! [19:25:37] 06Operations, 06Analytics-Kanban, 10EventBus, 13Patch-For-Review: setup/install/deploy kafka1003 (WMF4723) - https://phabricator.wikimedia.org/T148849#2764700 (10faidon) [19:31:53] twentyafterfour: green light [19:37:01] matanya: thanks [19:41:39] (03PS1) 10Faidon Liambotis: mirrors: serve ubuntu over rsync as well [puppet] - 10https://gerrit.wikimedia.org/r/319387 [19:42:19] (03CR) 10Faidon Liambotis: [C: 032] mirrors: serve ubuntu over rsync as well [puppet] - 10https://gerrit.wikimedia.org/r/319387 (owner: 10Faidon Liambotis) [19:43:12] (03PS1) 10Dzahn: fixing typos in mgmt DNS [dns] - 10https://gerrit.wikimedia.org/r/319388 [19:45:09] sigh jenkins [19:45:39] (03PS2) 10Dzahn: fixing typos in mgmt DNS [dns] - 10https://gerrit.wikimedia.org/r/319388 [19:46:40] mutante: the last fix has a typo [19:46:42] (03CR) 10Papaul: [V: 031] fixing typos in mgmt DNS [dns] - 10https://gerrit.wikimedia.org/r/319388 (owner: 10Dzahn) [19:46:48] (03CR) 10Faidon Liambotis: [V: 032] mirrors: serve ubuntu over rsync as well [puppet] - 10https://gerrit.wikimedia.org/r/319387 (owner: 10Faidon Liambotis) [19:47:20] (03CR) 10Faidon Liambotis: [C: 04-1] fixing typos in mgmt DNS (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/319388 (owner: 10Dzahn) [19:47:32] !log maintain-views --databases olowiki on labsdb1001 and 1003 to create view [19:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:25] (03PS3) 10Dzahn: fixing typos in mgmt DNS [dns] - 10https://gerrit.wikimedia.org/r/319388 [19:49:29] yes, thanks paravoid [19:51:41] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.29.0-wmf.1 [19:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:31] New error message in wmf.1: Unable to deliver event: 400: Must provide body in request. [19:56:09] (03CR) 10Dzahn: [C: 032] fixing typos in mgmt DNS [dns] - 10https://gerrit.wikimedia.org/r/319388 (owner: 10Dzahn) [19:56:27] twentyafterfour: also Catchable fatal error: Argument 3 passed to CoreParserFunctions::formatRaw() must be an instance of Language, StubUserLang given in /srv/mediawiki/php-1.29.0-wmf.1/includes/parser/CoreParserFunctions.php on line 496 [19:56:49] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Prepare and check production and labs-side filtering for olowiki - https://phabricator.wikimedia.org/T147302#2764842 (10chasemp) 05Open>03Resolved olowiki_p exists [19:59:43] hmm [20:00:04] gwicke, cscott, arlolra, subbu, bearND, mdholloway, halfak, Amir1, and yurik: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161102T2000). [20:00:49] twentyafterfour: this commit : https://phabricator.wikimedia.org/rMW4290f686c07265d40718fc3358f196de41bbde57 [20:02:48] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 05MW-1.28-release-notes, 13Patch-For-Review: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2764886 (10chasemp) [20:02:52] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Enable access to Wikipedia Tulu (tcywiki) on labs replicas - https://phabricator.wikimedia.org/T142223#2764883 (10chasemp) 05Open>03Resolved a:03chasemp _p view variant should be good to go [20:02:56] 06Operations, 10DBA, 06Labs, 07Tracking: Database replication services (tracking) - https://phabricator.wikimedia.org/T50930#2764887 (10chasemp) [20:03:16] 06Operations, 10DBA, 06Labs, 07Tracking: Database replication services (tracking) - https://phabricator.wikimedia.org/T50930#530760 (10chasemp) [20:03:20] 06Operations, 10DBA, 06Labs, 10Tool-Labs: Replicate wikimania2017wiki to labs - https://phabricator.wikimedia.org/T126096#2764890 (10chasemp) 05Open>03Resolved a:03chasemp _p view variant should be good to go [20:03:36] twentyafterfour: https://phabricator.wikimedia.org/T149840 [20:03:47] !log maintain-views --databases wikimania2017wiki --debug on labsdb1001 and 1003 [20:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:01] !log maintain-views --databases tcywiki --debug on labsdb1001 and 1003 [20:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:05] matanya: thank you [20:07:11] Should this be fixed soon (like now?) or is it a blocker warranting a train going backwards? [20:07:12] PROBLEM - puppet last run on lvs1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:08:25] If the latter, then adding T149059 as parent [20:08:26] T149059: MW-1.29.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T149059 [20:08:37] twentyafterfour [20:08:51] arseny92: i added it, and poked Nikerabbit [20:09:05] hope he sees it soon and helps us judge [20:09:33] (03PS6) 10Ottomata: Use ordered_yaml function to render node::service deployment vars [puppet] - 10https://gerrit.wikimedia.org/r/319129 (https://phabricator.wikimedia.org/T148779) [20:09:58] !log starting Parsoid deploy [20:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:12] PROBLEM - check_puppetrun on bellatrix is CRITICAL: CRITICAL: Puppet has 19 failures [20:10:25] twentyafterfour: Nikerabbit said in -i18n to revert it [20:12:19] Amire80 was the one who merged that patch and legoktm cherrypicked to release [20:12:37] what [20:12:54] legoktm https://gerrit.wikimedia.org/r/#/c/110342 [20:13:13] (03PS7) 10Ottomata: Use ordered_yaml function to render node::service deployment vars [puppet] - 10https://gerrit.wikimedia.org/r/319129 (https://phabricator.wikimedia.org/T148779) [20:13:48] uh [20:13:52] stack trace [20:14:36] twentyafterfour: okay, I'd recommend reverting https://gerrit.wikimedia.org/r/#/c/110342/ for now (out of master and REL1_28) [20:15:12] PROBLEM - check_puppetrun on bellatrix is CRITICAL: CRITICAL: Puppet has 19 failures [20:17:11] (03PS1) 10Dzahn: fix typo labdsb1008.mgmt -> labsdb1008.mgmt [dns] - 10https://gerrit.wikimedia.org/r/319396 [20:18:27] legoktm , twentyafterfour , i already have the patch pages open for reverting. Train conductor verdict please ;) [20:18:46] link? [20:19:20] (03PS2) 10Dzahn: fix typo labdsb1008.mgmt -> labsdb1008.mgmt [dns] - 10https://gerrit.wikimedia.org/r/319396 (https://phabricator.wikimedia.org/T149829) [20:19:45] (03CR) 10Dzahn: [C: 032] fix typo labdsb1008.mgmt -> labsdb1008.mgmt [dns] - 10https://gerrit.wikimedia.org/r/319396 (https://phabricator.wikimedia.org/T149829) (owner: 10Dzahn) [20:20:12] RECOVERY - check_puppetrun on bellatrix is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [20:21:48] twentyafterfour , shall we revert? y/n ;) [20:24:03] twentyafterfour: https://gerrit.wikimedia.org/r/#/c/319400/ [20:25:28] is there any partial outage or anything not software related that would explain the performance regression I'm seeing on https://performance.wikimedia.org/#!/week? [20:25:49] legoktm: hrm, should this go into 1.28.0-rc.0 as well? [20:26:06] thcipriani: I +2'd a backport for REL1_28 [20:27:08] 06Operations, 10Gerrit, 10grrrit-wm, 13Patch-For-Review: Support restarting grrrit-wm automatically when we restart production gerrit - https://phabricator.wikimedia.org/T149609#2766021 (10Zppix) @hashar As you may already know we are actually doing not that we are attempting (still WIP) to get ssh to gerr... [20:27:24] 06Operations, 10ops-eqiad, 10hardware-requests, 13Patch-For-Review: Move labsdb1008 to production, rename it back to db1095, use it as a temporary sanitarium - https://phabricator.wikimedia.org/T149829#2766033 (10chasemp) Ok with me :) [20:27:38] im shocked wikibugs got kicked [20:27:45] !log updated Parsoid to version 173d7e32 (T149241, T119228, T141723, T141905, T147742, T48580, T133320) [20:27:54] legoktm: ah, past-tense :) [20:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:59] T48580: Create a VisualEditor plugin to integrate with ProofreadPage - https://phabricator.wikimedia.org/T48580 [20:27:59] T141723: [Bug] Wikipedia article on the letter "ß" does not load properly. - https://phabricator.wikimedia.org/T141723 [20:27:59] T141905: Parsoid crashes from logs - https://phabricator.wikimedia.org/T141905 [20:27:59] T119228: Switch Parsoid to use node > 0.10 (most likely node 4.x) - https://phabricator.wikimedia.org/T119228 [20:27:59] T133320: Unified extension registration mechanism for core/VE/Parsoid - https://phabricator.wikimedia.org/T133320 [20:28:00] T149241: Unknown contentmodel wikibase-item - https://phabricator.wikimedia.org/T149241 [20:28:00] T147742: Template generated data-mw include incomplete href - https://phabricator.wikimedia.org/T147742 [20:28:09] sorry, uploaded a patch-set 2 [20:28:18] thcipriani: you'll need to re+2 then [20:28:20] * legoktm off to class [20:29:09] ori: i am not aware of any [20:31:33] anyone notice that grrrit-wm is being slow? [20:32:38] legoktm https://gerrit.wikimedia.org/r/#/c/319403/ revert on master as you said above;) [20:32:52] 06Operations, 10Traffic: reimage cp4016 and cp1055 - https://phabricator.wikimedia.org/T149843#2766116 (10BBlack) [20:34:23] !log depool cp4016 - T149843 [20:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:30] T149843: reimage cp4016 and cp1055 - https://phabricator.wikimedia.org/T149843 [20:35:34] !log depool cp1055 - T149843 [20:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:01] marktraceur is the most recent edit on https://gerrit.wikimedia.org/r/#/c/318976/ better? [20:36:12] RECOVERY - puppet last run on lvs1012 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [20:37:03] Zppix: It's better in that it's marked in that one spot, but most of the rest of the code is also totally without value, so really, you should mark the entire patch as WIP instead of relying on one TODO comment. [20:37:35] Zppix: Also, all the rest of my comments are now lost in a past patchset, which is super [20:37:48] i can readd them if you wish? [20:37:56] Zppix: Wow, that would just be aces [20:38:18] marktraceur i cannot tell if your being saracastic or what [20:38:34] Zppix: I'm pretty frustrated with the whole CR process on this patch, so yeah, sarcasm [20:38:45] Zppix: Read the CR, fix *all* the issues, and then submit a new patchset. [20:38:47] (03CR) 10Ottomata: "hookay, this seems to work better with the ordered_yaml function:" [puppet] - 10https://gerrit.wikimedia.org/r/319129 (https://phabricator.wikimedia.org/T148779) (owner: 10Ottomata) [20:39:21] marktraceur, I'm the one working on the whitelist paladox is the one working on the ssh to gerrit stuff (mostly because hes been trying to find solutions) [20:40:12] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [20:40:18] 06Operations, 10Traffic: reimage cp4016 and cp1055 - https://phabricator.wikimedia.org/T149843#2766116 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['cp4016.ulsfo.wmnet', 'cp1055.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimag... [20:40:26] Zppix: I understand that. Maybe you should be working on separate patches to avoid stepping on one another's toes, like hiding comments in past patchsets. [20:41:02] marktraceur that would cause merge conflicts more then likely since what we are working on is talking to each other (and in the same files) [20:41:09] than* [20:42:00] Zppix: I don't think it would be too hard. You can write the patch that adds canUserRestart(), and all the functionality, and he can add the patch that checks canUserRestart(from) and restarts the server. [20:42:23] marktraceur you realise i have more features planned then just the restart right? [20:42:27] (03CR) 10Jcrespo: "> not sure why this change depends on non-automatic pulls of the latest by git::clone (or being deployed by scap). Granted we could deploy" [puppet] - 10https://gerrit.wikimedia.org/r/319300 (https://phabricator.wikimedia.org/T149340) (owner: 10Alexandros Kosiaris) [20:43:01] Zppix: No, but I don't see how additional features that use canUserRestart() would change my suggestion. [20:43:21] marktraceur i'm not disagreeing with you i see why you can/probably are pissed atm but i also know merge conflicts are a bitch [20:43:43] Zppix: Only if you're entangling separate features unnecessarily [20:44:20] marktraceur i'm willing to write in another patch but i also know that having it in one place also has its benefits [20:44:31] Zppix: Sure, you do you [20:44:43] Just make sure it's WIP so I don't waste another half hour reviewing it [20:45:12] RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 211 seconds ago with 0 failures [20:45:21] 06Operations, 10Wikimedia-General-or-Unknown: Icinga has httpauth on (not accessible for public) - https://phabricator.wikimedia.org/T62112#2766185 (10Multichill) >>! In T62112#2763374, @MoritzMuehlenhoff wrote: > Although we do have a security-support version of Icinga running now, I'm against opening up the... [20:45:29] marktraceur, don't get me wrong you have your ways of doing stuff and i have my ways, and thats perfectly acceptable, and honestly, i've personally mentioned on phab tasks and on gerrit (i believe) that this patch is indeed WIP [20:46:05] Zppix: It's not at the front of the commit message, which is what literally every WIP patch on Gerrit uses. [20:46:16] !log starting mobileapps deployment [20:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:32] PROBLEM - IPsec on cp1066 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4016_v4, cp4016_v6 [20:46:33] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp4016_v4, cp4016_v6 [20:46:42] PROBLEM - IPsec on cp1068 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4016_v4, cp4016_v6 [20:46:42] PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp4016_v4, cp4016_v6 [20:46:42] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp4016_v4, cp4016_v6 [20:46:42] PROBLEM - IPsec on cp1065 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4016_v4, cp4016_v6 [20:46:52] PROBLEM - IPsec on cp1054 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4016_v4, cp4016_v6 [20:46:53] marktraceur the reason being is it wasnt as wip at first but now that we got better understanding of what was wanted its a major overhaul [20:47:12] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp4016_v4, cp4016_v6 [20:47:21] OK, I don't think that's true, but whatever [20:47:22] PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4016_v4, cp4016_v6 [20:47:22] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp4016_v4, cp4016_v6 [20:47:32] PROBLEM - IPsec on cp1053 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4016_v4, cp4016_v6 [20:47:32] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 146 not-conn: cp4016_v4, cp4016_v6 [20:48:59] there updated marktraceur [20:49:22] !log deployed mobileapps 0ced96c [20:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:55] Zppix: Thanks, that helps. I'll ignore it until y'all ask for help or remove the tag. [20:50:27] marktraceur ack! Ill make sure i ping you if i or paladox remove [WIP] [20:50:47] arseny92: legoktm: I'm pro-revert ;) [20:55:54] 06Operations, 10Parsoid, 13Patch-For-Review, 15User-mobrovac: Deploy failed on wtp2017.codfw.wmnet - https://phabricator.wikimedia.org/T149115#2766213 (10Arlolra) Well, anecdotally, that seemed to help. Unfortunately, it still didn't get through cleanly, and in my haste I failed to note the issue. It see... [20:59:52] just waiting for https://gerrit.wikimedia.org/r/#/c/319400/ to merge [21:00:53] 06Operations: Something is wrong with installer root disk stuff - https://phabricator.wikimedia.org/T149845#2766226 (10BBlack) [21:09:40] !log twentyafterfour@tin Synchronized php-1.29.0-wmf.1/includes/parser/CoreParserFunctions.php: Deploy https://gerrit.wikimedia.org/r/#/c/319400/ refs T149840, T149059 (duration: 00m 51s) [21:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:49] T149059: MW-1.29.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T149059 [21:09:49] T149840: [Bug] parser fatal : Catchable fatal error: Argument 3 passed to CoreParserFunctions::formatRaw() must be an instance of Language - https://phabricator.wikimedia.org/T149840 [21:09:50] !log pooling cp1055 - T149843 [21:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:57] T149843: reimage cp4016 and cp1055 - https://phabricator.wikimedia.org/T149843 [21:10:06] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [21:11:16] PROBLEM - Varnish HTTP text-backend - port 3128 on cp1054 is CRITICAL: connect to address 10.64.32.106 and port 3128: Connection refused [21:11:57] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2766307 (10GWicke) [21:12:13] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#1874700 (10GWicke) [21:12:25] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Use content hash based image / thumb URLs - https://phabricator.wikimedia.org/T149847#2766313 (10GWicke) [21:14:24] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2766336 (10GWicke) I have refactored this RFC to focus on the public thumb API, and moved content hash based thumb naming to a sub-task. We would like... [21:15:06] RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 179 seconds ago with 0 failures [21:15:16] RECOVERY - Varnish HTTP text-backend - port 3128 on cp1054 is OK: HTTP OK: HTTP/1.1 200 OK - 187 bytes in 0.072 second response time [21:16:16] PROBLEM - puppet last run on thumbor1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:30:04] MaxSem and yurik: Dear anthropoid, the time has come. Please deploy Kartographer (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161102T2130). [21:30:29] (03PS1) 10Yurik: Enable maps snapshots on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319415 [21:30:30] MaxSem, ^ [21:30:39] never seen kartographer before. [21:30:48] Zppix, it has seen you! [21:30:58] RECOVERY - IPsec on cp1065 is OK: Strongswan OK - 44 ESP OK [21:30:58] RECOVERY - IPsec on cp1054 is OK: Strongswan OK - 44 ESP OK [21:31:08] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 148 ESP OK [21:31:11] from far far above... [21:31:18] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 148 ESP OK [21:31:19] RECOVERY - IPsec on cp1067 is OK: Strongswan OK - 44 ESP OK [21:31:28] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 148 ESP OK [21:31:28] RECOVERY - IPsec on cp1053 is OK: Strongswan OK - 44 ESP OK [21:31:38] RECOVERY - IPsec on cp1066 is OK: Strongswan OK - 44 ESP OK [21:31:38] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 148 ESP OK [21:31:38] RECOVERY - IPsec on cp1068 is OK: Strongswan OK - 44 ESP OK [21:31:48] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 148 ESP OK [21:31:48] RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 148 ESP OK [21:31:54] yurik, merging https://gerrit.wikimedia.org/r/#/c/319411/1 [21:32:15] yurik its even seen me in the shower err poor thing. xD alright im done i had my laugh [21:32:28] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Use content hash based image / thumb URLs - https://phabricator.wikimedia.org/T149847#2766398 (10GWicke) a:05brion>03None [21:32:47] Zppix, you laugh in the shower??? [21:34:56] no [21:35:02] i was meaning im done being off topic [21:35:56] hehe [21:35:59] !log pooling cp4016 - T149843 [21:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:06] T149843: reimage cp4016 and cp1055 - https://phabricator.wikimedia.org/T149843 [21:37:40] MaxSem: yurik what is this special window for? [21:38:37] greg-g, we want to be a bit more careful with enabling new map configuration. Usually we would do it via SWATs, but its better to have a full hour... i guess we are being overcautious [21:38:40] greg-g, we want to deploy a fix ( https://gerrit.wikimedia.org/r/#/c/319411/1 ), then attempt https://gerrit.wikimedia.org/r/319415 [21:39:22] yurik: more careful is good, just curious why this didn't go in the normal services window is all [21:39:37] no services involved [21:39:45] oh, right [21:39:55] names [21:39:59] carry on [21:40:08] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [21:41:40] !log maxsem@tin Synchronized php-1.29.0-wmf.1/extensions/Kartographer/: https://gerrit.wikimedia.org/r/#/c/319411/1 (duration: 00m 49s) [21:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:16] (03PS2) 10Dzahn: add projectcom.wikimedia.org for new private wiki [dns] - 10https://gerrit.wikimedia.org/r/305120 (https://phabricator.wikimedia.org/T143138) [21:42:57] yurik, jgirault ^^^ [21:43:38] MaxSem, you skipped mw1099?! oh my god [21:44:18] RECOVERY - puppet last run on thumbor1002 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [21:45:08] RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 175 seconds ago with 0 failures [21:45:10] (03CR) 10Dzahn: "should this wait until T143138#2763522 is unblocked by jcrespo?" [dns] - 10https://gerrit.wikimedia.org/r/305120 (https://phabricator.wikimedia.org/T143138) (owner: 10Dzahn) [21:45:29] MaxSem, seems that maps show up fine in https://test.wikipedia.org/wiki/Mapframe - lets merge the next one [21:46:16] (03PS2) 10Dzahn: phabricator: Fix empty "parentProject" when new project is a milestone [puppet] - 10https://gerrit.wikimedia.org/r/318699 (owner: 10Aklapper) [21:46:30] (03CR) 10Dzahn: [C: 032] phabricator: Fix empty "parentProject" when new project is a milestone [puppet] - 10https://gerrit.wikimedia.org/r/318699 (owner: 10Aklapper) [21:46:48] PROBLEM - puppet last run on mw1241 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:46:57] !log Restarting Jenkins due to deadlock with the beta cluster jobs [21:46:59] (03CR) 10Dereckson: "The database can't be created before jcrespo unblocks, but DNS and Apache can process in parallel. The goal is mainly to avoid future cont" [dns] - 10https://gerrit.wikimedia.org/r/305120 (https://phabricator.wikimedia.org/T143138) (owner: 10Dzahn) [21:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:56] (03CR) 10MaxSem: [C: 032] Enable maps snapshots on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319415 (owner: 10Yurik) [21:48:37] yurik: MaxSem: CI will be back shortly [21:48:51] sigh [21:49:15] it is processing your change :D [21:49:32] (03Merged) 10jenkins-bot: Enable maps snapshots on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319415 (owner: 10Yurik) [21:51:03] (03PS1) 10Yurik: Enable maps snapshots on cawiki, hewiki, mkwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319447 [21:51:11] MaxSem, ^ [21:51:35] (03PS1) 10Yuvipanda: statistics: Separate research mysql cluster credentials into define [puppet] - 10https://gerrit.wikimedia.org/r/319451 [21:52:05] yurik, pulled on mw1099 [21:52:11] * yurik looks [21:53:01] MaxSem, https://www.mediawiki.org/wiki/Help:Extension:Kartographer looks great - after a simple refresh [21:53:20] MaxSem, actually wait [21:53:21] sec [21:55:34] (03PS1) 10BBlack: network::constants: remove /64 from icinga hosts [puppet] - 10https://gerrit.wikimedia.org/r/319470 [21:56:12] MaxSem, it seems i had to do an null-save refresh to get the snapshot [21:56:30] (03PS2) 10BBlack: network::constants: remove /64 from icinga hosts [puppet] - 10https://gerrit.wikimedia.org/r/319470 [22:00:24] (03PS2) 10Yuvipanda: statistics: Separate research mysql cluster credentials into define [puppet] - 10https://gerrit.wikimedia.org/r/319451 [22:00:44] (03CR) 10Yuvipanda: [C: 032 V: 032] statistics: Separate research mysql cluster credentials into define [puppet] - 10https://gerrit.wikimedia.org/r/319451 (owner: 10Yuvipanda) [22:07:27] 06Operations, 10Ops-Access-Requests: access to wm-log-reader for viewing logs - https://phabricator.wikimedia.org/T149832#2766583 (10greg) Approved from my end. [22:07:28] (03PS3) 10BBlack: network::constants: remove /64 from icinga hosts [puppet] - 10https://gerrit.wikimedia.org/r/319470 [22:08:11] (03CR) 10BBlack: [C: 032 V: 032] network::constants: remove /64 from icinga hosts [puppet] - 10https://gerrit.wikimedia.org/r/319470 (owner: 10BBlack) [22:10:04] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [22:10:27] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/319415/ (duration: 00m 47s) [22:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:59] yurik, ^ [22:12:34] PROBLEM - NTP peers on acamar is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [22:13:34] RECOVERY - NTP peers on acamar is OK: NTP OK: Offset 2.1e-05 secs [22:14:24] (03PS1) 10Yuvipanda: paws_internal: Add mysql reseach creds to notebook1001 [puppet] - 10https://gerrit.wikimedia.org/r/319475 (https://phabricator.wikimedia.org/T149543) [22:14:37] (03PS2) 10Yuvipanda: paws_internal: Add mysql reseach creds to notebook1001 [puppet] - 10https://gerrit.wikimedia.org/r/319475 (https://phabricator.wikimedia.org/T149543) [22:14:44] RECOVERY - puppet last run on mw1241 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [22:15:04] RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 219 seconds ago with 0 failures [22:16:44] PROBLEM - NTP peers on achernar is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [22:17:44] RECOVERY - NTP peers on achernar is OK: NTP OK: Offset 0.001763 secs [22:18:14] PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:21:20] (03PS2) 10BBlack: site: add varnish_exporter to all varnishes [puppet] - 10https://gerrit.wikimedia.org/r/319295 (https://phabricator.wikimedia.org/T147424) (owner: 10Ema) [22:21:48] (03PS3) 10BBlack: prometheus: add varnish_exporter to all varnishes [puppet] - 10https://gerrit.wikimedia.org/r/319295 (https://phabricator.wikimedia.org/T147424) (owner: 10Ema) [22:22:17] RECOVERY - puppet last run on cp1066 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [22:23:42] (03CR) 10EBernhardson: Add a wiki configuration tag for configured language (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319253 (https://phabricator.wikimedia.org/T149755) (owner: 10EBernhardson) [22:24:39] (03CR) 10MaxSem: [C: 032] Enable maps snapshots on cawiki, hewiki, mkwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319447 (owner: 10Yurik) [22:24:57] PROBLEM - NTP peers on maerlant is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [22:25:14] (03Merged) 10jenkins-bot: Enable maps snapshots on cawiki, hewiki, mkwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319447 (owner: 10Yurik) [22:25:57] RECOVERY - NTP peers on maerlant is OK: NTP OK: Offset -0.000412 secs [22:26:44] yurik, pulled on mw1099 [22:26:48] * yurik looks [22:27:11] (03PS3) 10Yuvipanda: paws_internal: Add mysql reseach creds to notebook1001 [puppet] - 10https://gerrit.wikimedia.org/r/319475 (https://phabricator.wikimedia.org/T149543) [22:27:13] (03PS1) 10Yuvipanda: roles: Kill the 'notebook' roles [puppet] - 10https://gerrit.wikimedia.org/r/319476 [22:27:27] PROBLEM - NTP peers on chromium is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [22:27:32] (03CR) 10Yuvipanda: [C: 032 V: 032] paws_internal: Add mysql reseach creds to notebook1001 [puppet] - 10https://gerrit.wikimedia.org/r/319475 (https://phabricator.wikimedia.org/T149543) (owner: 10Yuvipanda) [22:28:47] PROBLEM - NTP peers on hydrogen is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [22:28:51] (03PS1) 10Filippo Giunchedi: Initial commit [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/319477 [22:29:27] RECOVERY - NTP peers on chromium is OK: NTP OK: Offset -0.000888 secs [22:30:47] RECOVERY - NTP peers on hydrogen is OK: NTP OK: Offset -0.000539 secs [22:31:29] (03CR) 10Filippo Giunchedi: "Metrics from mw1017:" [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/319477 (owner: 10Filippo Giunchedi) [22:32:07] PROBLEM - puppet last run on notebook1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 39 seconds ago with 1 failures. Failed resources (up to 3 shown): File[/etc/mysql/conf.d/research-client.cnf] [22:32:47] PROBLEM - NTP peers on nescio is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [22:33:01] (03PS2) 10Filippo Giunchedi: Initial commit [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/319477 (https://phabricator.wikimedia.org/T147423) [22:34:47] RECOVERY - NTP peers on nescio is OK: NTP OK: Offset -0.001158 secs [22:35:20] (03PS1) 10Yuvipanda: paws_internal: Provision research users on notebook node [puppet] - 10https://gerrit.wikimedia.org/r/319479 (https://phabricator.wikimedia.org/T149543) [22:37:07] RECOVERY - NTP on mw2098 is OK: NTP OK: Offset -0.002188533545 secs [22:38:51] (03CR) 10Yuvipanda: [C: 032 V: 032] paws_internal: Provision research users on notebook node [puppet] - 10https://gerrit.wikimedia.org/r/319479 (https://phabricator.wikimedia.org/T149543) (owner: 10Yuvipanda) [22:39:09] (03CR) 10Yuvipanda: [C: 032 V: 032] roles: Kill the 'notebook' roles [puppet] - 10https://gerrit.wikimedia.org/r/319476 (owner: 10Yuvipanda) [22:40:07] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [22:42:07] RECOVERY - puppet last run on notebook1001 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [22:45:07] RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 213 seconds ago with 0 failures [22:45:32] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/319447/ (duration: 00m 47s) [22:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:45] yurik, ^ [22:48:36] (03PS4) 10BBlack: prometheus: add varnish_exporter to all varnishes [puppet] - 10https://gerrit.wikimedia.org/r/319295 (https://phabricator.wikimedia.org/T147424) (owner: 10Ema) [22:48:44] (03CR) 10BBlack: [C: 032 V: 032] prometheus: add varnish_exporter to all varnishes [puppet] - 10https://gerrit.wikimedia.org/r/319295 (https://phabricator.wikimedia.org/T147424) (owner: 10Ema) [22:49:53] (03PS1) 10Dzahn: mgmt: fix-up grep regex in getmgmtips [puppet] - 10https://gerrit.wikimedia.org/r/319483 [22:51:16] (03PS2) 10Dzahn: mgmt: fix-up grep regex in getmgmtips [puppet] - 10https://gerrit.wikimedia.org/r/319483 [22:51:32] (03PS3) 10Filippo Giunchedi: Initial commit [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/319477 (https://phabricator.wikimedia.org/T147423) [22:51:51] (03CR) 10Dzahn: [C: 032] mgmt: fix-up grep regex in getmgmtips [puppet] - 10https://gerrit.wikimedia.org/r/319483 (owner: 10Dzahn) [22:51:53] (03CR) 10Filippo Giunchedi: "Initial sketch, tests and other endpoints from admin interface are TODO" [software/hhvm_exporter] - 10https://gerrit.wikimedia.org/r/319477 (https://phabricator.wikimedia.org/T147423) (owner: 10Filippo Giunchedi) [22:53:46] MaxSem, looks good to me [22:53:56] wee [22:57:52] (03PS3) 10Dzahn: mgmt: fix-up grep regex in getmgmtips [puppet] - 10https://gerrit.wikimedia.org/r/319483 [22:58:25] (03CR) 10Dzahn: [C: 032] "This fixes that host names with a "-" in them were not working." [puppet] - 10https://gerrit.wikimedia.org/r/319483 (owner: 10Dzahn) [23:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161102T2300). [23:00:04] James_F: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:09] * James_F waves. [23:01:06] anybody who knows wikidata on this swat? [23:02:00] aude: ping? [23:02:19] addshore: ping? [23:02:25] SMalyshev: them I'd say ^ [23:02:50] Dereckson: yeah thanks :) the question would be whether they are around :) [23:03:02] !leroy [23:03:35] (03CR) 10Dzahn: [V: 032] mgmt: fix-up grep regex in getmgmtips [puppet] - 10https://gerrit.wikimedia.org/r/319483 (owner: 10Dzahn) [23:03:52] SMalyshev: I've asked on #wikidata, but they are generally more reactive at UE hours [23:03:57] EU [23:05:18] SMalyshev: issue is you need someone to test your fix? [23:05:45] Dereckson: rather, I need someone to deploy it, and I have no idea how wikidata picks are deployed [23:06:15] so I wonder whether somebody here running the SWAT does :) [23:06:49] SMalyshev: I've deployed one before, but it was from a change from aude. [23:07:21] Dereckson: so you know how they are deployed? e.g. how the build works, etc.? [23:08:43] Bhttps://gerrit.wikimedia.org/r/#/c/319401/ seems a new code, not a fix by the way [23:08:57] wikidata is different than all the other extensions in that it is not branched with the other repos, they keep their own pace and they have a special version that we use as a submodule for core. I don't know how they tag/branch each version, but this is some background. [23:09:06] Dereckson: it's not a fix, it's a maintenance script [23:10:05] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [23:10:11] SMalyshev: well, SWAT is for config and fixes, not new features [23:10:34] Dereckson: so how you deploy maintenance script? [23:10:53] with the train for other extensions, during the Wikidata windows for that [23:10:55] (03PS1) 10Dzahn: add forward DNS for papaul-laptop.mgmt [dns] - 10https://gerrit.wikimedia.org/r/319490 [23:11:10] Question is there anyway i can volunteer to help out with deployments? [23:11:11] (hmmmm they don't have that anymore) [23:11:54] SMalyshev: I'll ask aude to look at your patches tomorrow morning if you wish, and ask how/when it could be deployed [23:12:32] SMalyshev: excepted if there is any emergency, like a need to run this script right now, in that case, we can look at that now [23:12:45] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [23:12:51] Dereckson: no, don't need it right now, though I'd really want to do it this week [23:13:07] since I'm on PTO next week and I wanted to wrap up this thing before I go [23:13:13] * Dereckson nods. [23:13:39] OK, I'll move it to morning one tomorrow and try to wake up too by then :) [23:13:56] PROBLEM - Host cp3047 is DOWN: PING CRITICAL - Packet loss = 100% [23:14:05] PROBLEM - Host cp3041 is DOWN: PING CRITICAL - Packet loss = 100% [23:14:06] PROBLEM - Host cp3030 is DOWN: PING CRITICAL - Packet loss = 100% [23:14:06] Request from via cp3007 cp3007, Varnish XID 118344883 [23:14:06] Error: 503, Backend fetch failed at Wed, 02 Nov 2016 23:13:20 GMT [23:14:14] Yup [23:14:16] PROBLEM - Host cp3042 is DOWN: PING CRITICAL - Packet loss = 100% [23:14:17] I just pinged people [23:14:21] Commons is down :( [23:14:25] RECOVERY - Host cp3030 is UP: PING OK - Packet loss = 0%, RTA = 119.58 ms [23:14:25] RECOVERY - Host cp3041 is UP: PING OK - Packet loss = 0%, RTA = 119.67 ms [23:14:26] RECOVERY - Host cp3047 is UP: PING OK - Packet loss = 0%, RTA = 119.54 ms [23:14:26] RECOVERY - Host cp3042 is UP: PING OK - Packet loss = 0%, RTA = 119.80 ms [23:14:26] in the EU, probably [23:14:36] looks like some kind of network flap? [23:14:38] Nice one Reedy [23:14:45] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [23:14:47] James_F: let's wait a little bit the current issue is solved, and I take care of your config change [23:15:06] PROBLEM - Host eeden is DOWN: PING CRITICAL - Packet loss = 100% [23:15:06] RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 208 seconds ago with 0 failures [23:15:07] OK [23:15:09] <_joe_> bblack: looks like it [23:15:15] PROBLEM - Host ns2-v4 is DOWN: PING CRITICAL - Packet loss = 100% [23:15:19] <_joe_> should we depool esams? [23:15:34] <_joe_> (enwiki was working fine for me) [23:15:35] Looks like it's bck? [23:15:35] Seems to be beack now [23:15:37] back* [23:15:45] wikipedia is down [23:15:45] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [23:15:45] RECOVERY - Host eeden is UP: PING OK - Packet loss = 0%, RTA = 119.71 ms [23:15:49] Krenair: lol that we both typo'ed on it [23:15:54] oh, you know it already [23:16:01] <_joe_> matanya: still "down"? [23:16:06] flappy [23:16:07] I'm not sure yet [23:16:08] back since eeden is back? [23:16:15] RECOVERY - Host ns2-v4 is UP: PING OK - Packet loss = 0%, RTA = 120.71 ms [23:16:19] <_joe_> loads fine for me [23:16:23] seems more up than down for me [23:16:26] out of 5 requests got 3 [23:16:43] <_joe_> matanya: there was an issue that *seems* recovered [23:16:55] PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:16:56] it seems consistently up for me [23:16:59] <_joe_> but we have no idea yet what it was of if it's over [23:16:59] yes, new refreshes all come ok [23:17:08] right [23:17:15] PROBLEM - salt-minion processes on cp3012 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [23:17:19] so esams internal networking temporarily broke? [23:17:25] unlikely [23:17:34] <_joe_> bblack: puppet failures seem to support the idea of a network connection issue esams-eqiad [23:17:35] more likely we had some kind of flap on our esams->eqiad backhaul [23:17:40] esams-eqiad link then? [23:17:42] but no router alert [23:17:42] yeah [23:17:44] <_joe_> and the salt minion failure too [23:17:49] checking librenms [23:18:17] my machines connected to ulsfo didn't have a problem, only those connected to esams [23:18:30] <_joe_> there is definitely a 3-minute hole in ganglia [23:18:31] didn't check eqiad [23:18:52] <_joe_> matanya: it's just esams [23:18:52] What about bastion? [23:19:17] What about bastion? [23:19:20] _joe_: from what i can tell, yes [23:19:25] Is it still up [23:19:44] Its giving me publickey error [23:19:45] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:19:47] I didn't loose any connections to labs/tool labs [23:20:44] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] [23:21:05] Dereckson, are you doing the swat today? [23:21:26] yes, but let's wait a few minutes [23:21:37] this seems to be fallout from Level3's issues today [23:21:39] Zppix|mobile, it's supposed to give you public key errors [23:21:48] Dereckson, no worries, just wanted to add https://gerrit.wikimedia.org/r/#/c/319493/ - i will post it to depl page [23:21:52] the link in question is a wave we lease from them [23:22:01] yurik: is it possible for me to volunteer to do deployments [23:22:04] you don't have any kind of production shell access, bast3001 won't let you in [23:22:06] bblack: "#10665: 2016-11-02 Scheduled Utility Switchboard Maintenance at the DC6 IBX" maybe? [23:22:07] no Zppix|mobile [23:22:12] No Krenair to even access stuff [23:22:20] Zppix|mobile, yes, but only if you are doing it from mobile phone [23:22:30] yurik: ok [23:22:34] yurik: ? [23:22:43] your nick has |mobile :) [23:22:44] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] [23:23:00] yurik: yes i know but im confused [23:23:14] RECOVERY - salt-minion processes on cp3012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:23:20] Zppix|mobile: A) yurik has no power over this and B) I haven't seen the requisite need/experience/knowledge [23:23:24] one more 503 out of several requests [23:24:28] sorry Zppix|mobile, i shouldn't joke in prod channel :( [23:24:48] Its ok yurik i didnt even know what you meant [23:25:11] bummer. it was not even a funny joke :( [23:25:45] Dereckson, jgirault is also on that patch, ping either one of us [23:25:52] i added it to the depl [23:25:55] * Dereckson nods [23:25:58] So everyone: when there's a current outage situation going on jokes and or side conversations/requests are completely offtopic and unwanted. [23:26:59] so far it looks like it was a very brief network flap for esams<->eqiad backhaul traffic [23:26:59] _joe_: i am quite sure it is 100% over from user perspective [23:27:08] but still investigating [23:27:27] <_joe_> matanya: I think bblack is going to investigate, while I head to bed :P [23:27:29] ("very brief" meaning < ~5 minutes, I don't know how much less) [23:27:44] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:29:18] (03PS1) 10Yuvipanda: jupyterhub: Setup HTTP Proxy for each spawned node [puppet] - 10https://gerrit.wikimedia.org/r/319494 (https://phabricator.wikimedia.org/T149543) [23:29:59] (03PS2) 10Yuvipanda: jupyterhub: Setup HTTP Proxy for each spawned node [puppet] - 10https://gerrit.wikimedia.org/r/319494 (https://phabricator.wikimedia.org/T149543) [23:30:12] _joe_: i should do the same, thanks folks, one day we will be able to use https://www.wikipediap2p.org/ without an extension [23:31:44] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [23:31:44] (03PS1) 10Dzahn: remove db1027 remnants (reverse lookup) [dns] - 10https://gerrit.wikimedia.org/r/319495 [23:32:29] (03PS2) 10Dzahn: remove db1027 remnants (reverse lookup) [dns] - 10https://gerrit.wikimedia.org/r/319495 (https://phabricator.wikimedia.org/T135253) [23:33:26] Down again [23:33:31] and now completely done bblack [23:33:55] PROBLEM - Host cp3049 is DOWN: PING CRITICAL - Packet loss = 100% [23:33:58] *down [23:34:03] me too [23:34:12] down [23:34:13] bleh [23:34:14] RECOVERY - Host cp3049 is UP: PING OK - Packet loss = 0%, RTA = 119.64 ms [23:34:44] only hitting cp3042 on all requests [23:35:11] EU isn't technically right... [23:35:13] But meh [23:35:14] PROBLEM - salt-minion processes on cp3012 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [23:35:40] "Italy + other countries over there" then :p [23:35:56] and now partially back [23:35:56] seems to be back now [23:36:44] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [23:37:24] 06Operations, 10Traffic: 503 errors for users connecting to esams - https://phabricator.wikimedia.org/T149865#2766895 (10Matanya) [23:37:25] (03CR) 10Yuvipanda: [C: 032 V: 032] jupyterhub: Setup HTTP Proxy for each spawned node [puppet] - 10https://gerrit.wikimedia.org/r/319494 (https://phabricator.wikimedia.org/T149543) (owner: 10Yuvipanda) [23:37:38] !log disabling port xe-0/1/3 on cr2-esams (wave to eqiad, level3) [23:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:03] matanya: I've seen a "Wikipedia Foundation" on the website, proposed https://github.com/guerrerocarlos/WikipediaP2P.org/pull/1 to fix that [23:38:44] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [23:39:04] PROBLEM - puppet last run on es1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:39:44] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [23:40:04] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/1/3: down - Core: cr2-esams:xe-0/1/3 (Level3, BDFS2448, 84ms) {#2013} [10Gbps wave]BR [23:40:04] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [23:40:59] nice Dereckson :) [23:44:04] RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [23:44:57] !log disabling port xe-4/1/3 on cr2-eqiad (wave to esams, level3, other side of earlier disable) [23:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:04] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 [23:45:04] RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 200 seconds ago with 0 failures [23:46:34] 06Operations, 10RESTBase, 10Traffic, 06Services (doing): Restbase redirects with cors not working on Android 4 native browser - https://phabricator.wikimedia.org/T149295#2766938 (10Pchelolo) Created a PR to fix it, after review will test it in labs, hopefully it's gonna fix everything: https://github.com/w... [23:47:44] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [23:49:44] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:49:59] Dereckson: So, SWAT off and I should re-propose for the morning? :-( [23:50:56] James_F: I wait a green light for bblack when/if the issue is stabilized [23:51:01] from [23:52:33] we seem to be stable [23:52:51] Dereckson: pong [23:53:18] * addshore wasn't here for swat [23:53:21] RECOVERY - salt-minion processes on cp3012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:53:51] addshore: SMalyshev would like to know how it could get Wikidata maintenance scripts deployed: https://gerrit.wikimedia.org/r/#/c/319401/ [23:54:07] Dereckson: I'm already talking with aude :) [23:54:17] as a part of the train now I guess? [23:54:20] so it's probably covered :) [23:55:17] =] [23:55:32] * addshore goes back to writing lists [23:59:05] yurik: jgirault: 'Sets font size to 14px for both static and interactive maps' is live on mw1099