[00:18:58] PROBLEM - MegaRAID on db1016 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [02:30:03] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.11) (duration: 10m 16s) [02:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:51:00] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.12) (duration: 07m 56s) [02:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:57:42] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Aug 7 02:57:42 UTC 2017 (duration 6m 42s) [02:57:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:26:07] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 817.30 seconds [03:52:08] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 297.06 seconds [04:38:57] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:08:17] RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:20:17] 10Operations, 10ops-eqiad, 10DBA: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3505078 (10Marostegui) The BBU is failing again, so we should try to give m1 master failover some priority amongst the other misc services. [06:20:46] !log Force BBU re-learn on db1016 - T166344 [06:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:00] T166344: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344 [06:22:27] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370438 [06:22:30] (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370438 [06:23:57] (03PS1) 10Giuseppe Lavagetto: puppetmaster::puppetdb::client: fix dependencies. [puppet] - 10https://gerrit.wikimedia.org/r/370439 (https://phabricator.wikimedia.org/T172547) [06:24:30] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370438 (owner: 10Marostegui) [06:25:57] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370438 (owner: 10Marostegui) [06:26:07] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2073" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370438 (owner: 10Marostegui) [06:27:09] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2073 - T171321 (duration: 00m 47s) [06:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:22] T171321: Finish dbstore2002 migration to multi-instance - https://phabricator.wikimedia.org/T171321 [06:28:10] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster::puppetdb::client: fix dependencies. [puppet] - 10https://gerrit.wikimedia.org/r/370439 (https://phabricator.wikimedia.org/T172547) (owner: 10Giuseppe Lavagetto) [06:29:08] RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [06:29:50] 10Operations, 10ops-eqiad, 10DBA: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3505090 (10Marostegui) After forcing the relearn, this recovered: ``` ˜/icinga-wm 8:29> RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy ``` [06:30:57] (03PS1) 10Marostegui: db-codfw.php: Depool db2074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370440 (https://phabricator.wikimedia.org/T171321) [06:32:13] 10Puppet, 10Cloud-VPS, 10Patch-For-Review: ::profile::puppetmaster::common missing dependencies when $storeconfigs=puppetdb - https://phabricator.wikimedia.org/T172547#3505092 (10Joe) 05Open>03Resolved [06:33:00] !log Stop replication on db2075 - T170662 [06:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:12] T170662: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662 [06:33:56] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370440 (https://phabricator.wikimedia.org/T171321) (owner: 10Marostegui) [06:35:21] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370440 (https://phabricator.wikimedia.org/T171321) (owner: 10Marostegui) [06:36:10] (03CR) 10jenkins-bot: db-codfw.php: Depool db2074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370440 (https://phabricator.wikimedia.org/T171321) (owner: 10Marostegui) [06:37:39] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2074 - T171321 (duration: 00m 46s) [06:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:51] T171321: Finish dbstore2002 migration to multi-instance - https://phabricator.wikimedia.org/T171321 [06:38:26] !log Stop MySQL on db2074 - T171321 [06:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:44] (03PS1) 10Marostegui: mariadb: Add s3 to dbstore2002 [puppet] - 10https://gerrit.wikimedia.org/r/370441 (https://phabricator.wikimedia.org/T171321) [06:45:36] (03CR) 10Marostegui: "Puppet looks good: https://puppet-compiler.wmflabs.org/compiler02/7314/" [puppet] - 10https://gerrit.wikimedia.org/r/370441 (https://phabricator.wikimedia.org/T171321) (owner: 10Marostegui) [06:45:41] (03CR) 10Marostegui: [C: 032] mariadb: Add s3 to dbstore2002 [puppet] - 10https://gerrit.wikimedia.org/r/370441 (https://phabricator.wikimedia.org/T171321) (owner: 10Marostegui) [06:52:33] (03PS1) 10Marostegui: db-codfw.php: Depool db2065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370442 [06:59:35] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370442 (owner: 10Marostegui) [07:01:05] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370442 (owner: 10Marostegui) [07:01:15] (03CR) 10jenkins-bot: db-codfw.php: Depool db2065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370442 (owner: 10Marostegui) [07:02:16] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2065 to reimport: page, linter and watchlist tables (duration: 00m 47s) [07:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:37] !log Stop replication on db2065 to reimport: page, linter and watchlist tables [07:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:30] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Analytics1034 eth0 negotiated speed to 100Mb/s instead of 1000Mb/s - https://phabricator.wikimedia.org/T172633#3505104 (10elukey) Tried this: * ifdown eth0 * modprobe -r tg3 * modprobe tg3 * ifup eth0 ``` [Mon Aug 7 07:28:13 2017] pps_core: LinuxPPS... [08:09:08] PROBLEM - MegaRAID on db1016 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [08:11:22] Again.. [08:11:54] 10Operations, 10ops-eqiad, 10DBA: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3505175 (10Marostegui) And again: `˜/icinga-wm 10:09> PROBLEM - MegaRAID on db1016 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough` [08:12:01] !log Force BBU re-learn on db1016 - T166344 [08:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:15] T166344: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344 [08:13:20] 10Operations, 10ops-eqiad, 10DBA: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3293244 (10jcrespo) Maybe we can setup m1 on db1069? [08:18:27] 10Operations, 10ops-eqiad, 10DBA: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3505184 (10Marostegui) >>! In T166344#3505178, @jcrespo wrote: > Maybe we can setup m1 on db1069? I like that idea, I'll try to work on: T166546 soon as I am about to finish with: T153743 [08:28:59] 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): some elasticsearch servers in eqiad have CPU overheating - https://phabricator.wikimedia.org/T168816#3505200 (10Gehel) elastic1017-1031 have had thermal paste applied. Looking at [[ https://grafana.wikimedia.org/dashboard/db/prometheus... [08:30:47] Is there anyone available here that can help me log in to Phabricator? [08:31:06] I'm locked out due to no longer having the device that it wants an auth code from. [08:37:57] Ouch [08:38:00] Hey Deskana :D [08:38:11] How are ya? [08:45:40] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370444 [09:00:41] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370444 (owner: 10Marostegui) [09:02:05] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370444 (owner: 10Marostegui) [09:02:19] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370444 (owner: 10Marostegui) [09:03:53] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2065 after fixing: linter, page and watchlist tables (duration: 00m 47s) [09:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:33] !log set net.netfilter.nf_conntrack_tcp_timeout_time_wait=65 (was 120) on all the analytics kafka brokers - T136094 [09:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:44] T136094: Race condition in setting net.netfilter.nf_conntrack_tcp_timeout_time_wait - https://phabricator.wikimedia.org/T136094 [09:07:06] 10Operations, 10Patch-For-Review, 10User-Elukey: Race condition in setting net.netfilter.nf_conntrack_tcp_timeout_time_wait - https://phabricator.wikimedia.org/T136094#3505255 (10elukey) [09:17:21] (03PS1) 10Giuseppe Lavagetto: role::puppet_compiler: bind ssl to 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/370445 [09:37:35] (03PS1) 10Jcrespo: mariadb: Pool db1098 as new s8 recentchanges/watchlist host [puppet] - 10https://gerrit.wikimedia.org/r/370447 (https://phabricator.wikimedia.org/T172679) [09:42:34] (03PS2) 10Jcrespo: mariadb: Pool db1098 as new s6 recentchanges/watchlist host [puppet] - 10https://gerrit.wikimedia.org/r/370447 (https://phabricator.wikimedia.org/T172679) [09:47:45] !log stopping db1050's mysql and cloning it to db1089 [09:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:08] (03PS1) 10Marostegui: s3.hosts: dbstore2002 is now replicating s3 [software] - 10https://gerrit.wikimedia.org/r/370448 (https://phabricator.wikimedia.org/T171321) [09:52:17] (03PS1) 10Marostegui: mariadb: dbstore2002 has now 5 shards replicating [puppet] - 10https://gerrit.wikimedia.org/r/370449 (https://phabricator.wikimedia.org/T171321) [09:52:24] jynus: ^ [09:52:53] (03CR) 10Jcrespo: [C: 031] mariadb: dbstore2002 has now 5 shards replicating [puppet] - 10https://gerrit.wikimedia.org/r/370449 (https://phabricator.wikimedia.org/T171321) (owner: 10Marostegui) [09:53:12] (03CR) 10Marostegui: [C: 032] mariadb: dbstore2002 has now 5 shards replicating [puppet] - 10https://gerrit.wikimedia.org/r/370449 (https://phabricator.wikimedia.org/T171321) (owner: 10Marostegui) [09:56:05] (03CR) 10Marostegui: [C: 032] s3.hosts: dbstore2002 is now replicating s3 [software] - 10https://gerrit.wikimedia.org/r/370448 (https://phabricator.wikimedia.org/T171321) (owner: 10Marostegui) [09:56:51] (03Merged) 10jenkins-bot: s3.hosts: dbstore2002 is now replicating s3 [software] - 10https://gerrit.wikimedia.org/r/370448 (https://phabricator.wikimedia.org/T171321) (owner: 10Marostegui) [09:59:24] (03PS2) 10Giuseppe Lavagetto: role::puppet_compiler: bind ssl to 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/370445 [10:02:14] !log Add dbstore2002:3313 to tendril - T171321 [10:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:26] T171321: Finish dbstore2002 migration to multi-instance - https://phabricator.wikimedia.org/T171321 [10:07:02] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/7321/" [puppet] - 10https://gerrit.wikimedia.org/r/370445 (owner: 10Giuseppe Lavagetto) [10:14:34] qchris: around? [10:14:45] 10Operations, 10Analytics-Kanban, 10User-Elukey: Analytics Kafka cluster causing timeouts to Varnishkafka since July 28th - https://phabricator.wikimedia.org/T172681#3505470 (10elukey) [10:18:59] 10Operations, 10Analytics-Kanban, 10User-Elukey: Analytics Kafka cluster causing timeouts to Varnishkafka since July 28th - https://phabricator.wikimedia.org/T172681#3505486 (10elukey) [10:19:08] RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [10:23:03] 10Operations, 10Analytics-Kanban, 10User-Elukey: Analytics Kafka cluster causing timeouts to Varnishkafka since July 28th - https://phabricator.wikimedia.org/T172681#3505490 (10elukey) [10:23:34] (03PS1) 10Giuseppe Lavagetto: puppet-compiler: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/370451 [10:24:28] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet-compiler: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/370451 (owner: 10Giuseppe Lavagetto) [10:28:02] (03CR) 10MarcoAurelio: Enable wgMinervaEnableSiteNotice for kowiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370363 (https://phabricator.wikimedia.org/T172630) (owner: 10Revi) [10:30:12] Hi Reedy: do you think https://gerrit.wikimedia.org/r/#/c/370310/ is ready? [10:40:18] (03PS3) 10Revi: Enable wgMinervaEnableSiteNotice for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370363 (https://phabricator.wikimedia.org/T172630) [10:41:52] (03CR) 10Revi: Enable wgMinervaEnableSiteNotice for kowiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370363 (https://phabricator.wikimedia.org/T172630) (owner: 10Revi) [10:42:10] err apple waych [10:42:12] watch* [10:42:47] (03CR) 10MarcoAurelio: [C: 031] Enable wgMinervaEnableSiteNotice for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370363 (https://phabricator.wikimedia.org/T172630) (owner: 10Revi) [10:42:54] :D [10:43:23] :DD [10:48:47] Volker_E: Yup. What's up? [10:53:44] (03CR) 10Thiemo Mättig (WMDE): [C: 031] mediawiki: Another increase of batch size in dispatchChanges cronjob [puppet] - 10https://gerrit.wikimedia.org/r/370315 (https://phabricator.wikimedia.org/T171263) (owner: 10Ladsgroup) [11:17:58] (03PS1) 10Ladsgroup: beta: Add copyright info for Wikidata API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370455 (https://phabricator.wikimedia.org/T112606) [11:21:37] 10Operations, 10ops-eqiad, 10DBA: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3505584 (10Marostegui) ``` ˜/icinga-wm 12:19> RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy ``` [11:29:07] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [11:36:07] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [11:38:07] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [12:07:26] (03PS12) 10Gehel: logstash: Parse nginx access logs for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/299825 (owner: 10BryanDavis) [12:08:04] !log deploying https://gerrit.wikimedia.org/r/#/c/299825/ - some logs will be lost during logstash restart [12:08:08] (03CR) 10Gehel: [C: 032] logstash: Parse nginx access logs for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/299825 (owner: 10BryanDavis) [12:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:13] _joe_: there is puppet compiler fixed not yet merged on puppetmaster1001. Ok if I merge it with my change? [12:10:09] _joe_: it looks trivial enough, I'm merging it... [12:15:37] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370457 [12:15:41] (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370457 [12:22:16] PROBLEM - PyBal backends health check on lvs1009 is CRITICAL: PYBAL CRITICAL - logstash-syslog-tcp_10514 - Could not depool server logstash1001.eqiad.wmnet because of too many down!: logstash-json-tcp_11514 - Could not depool server logstash1001.eqiad.wmnet because of too many down!: logstash-log4j_4560 - Could not depool server logstash1001.eqiad.wmnet because of too many down!: logstash-json-udp_11514_udp - Could not depool s [12:22:16] eqiad.wmnet because of too many down!: logstash-syslog-udp_10514_udp - Could not depool server logstash1001.eqiad.wmnet because of too many down! [12:22:36] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - logstash-syslog-udp_10514_udp - Could not depool server logstash1002.eqiad.wmnet because of too many down!: logstash-json-tcp_11514 - Could not depool server logstash1002.eqiad.wmnet because of too many down!: logstash-log4j_4560 - Could not depool server logstash1002.eqiad.wmnet because of too many down!: logstash-json-udp_11514_udp - Could not depo [12:22:36] 002.eqiad.wmnet because of too many down!: logstash-syslog-tcp_10514 - Could not depool server logstash1002.eqiad.wmnet because of too many down! [12:22:46] ^that's me, rolling back ... [12:22:56] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - logstash-syslog-tcp_10514 - Could not depool server logstash1002.eqiad.wmnet because of too many down!: logstash-json-tcp_11514 - Could not depool server logstash1002.eqiad.wmnet because of too many down!: logstash-log4j_4560 - Could not depool server logstash1002.eqiad.wmnet because of too many down!: logstash-json-udp_11514_udp - Could not depool s [12:22:56] eqiad.wmnet because of too many down!: logstash-syslog-udp_10514_udp - Could not depool server logstash1002.eqiad.wmnet because of too many down! [12:23:41] (03PS1) 10Gehel: Revert "logstash: Parse nginx access logs for wdqs" [puppet] - 10https://gerrit.wikimedia.org/r/370460 [12:23:47] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370457 (owner: 10Marostegui) [12:23:55] (03CR) 10Gehel: [V: 032 C: 032] Revert "logstash: Parse nginx access logs for wdqs" [puppet] - 10https://gerrit.wikimedia.org/r/370460 (owner: 10Gehel) [12:25:01] PROBLEM - LVS HTTP IPv4 on logstash.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.36 and port 10514: Connection refused [12:25:19] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370457 (owner: 10Marostegui) [12:25:30] now I need to understand why this is working fine on labs... [12:26:05] <_joe_> gehel: yeah sorry [12:26:20] _joe_: no problem :) [12:26:23] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2074 - T171321 (duration: 00m 45s) [12:26:23] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370457 (owner: 10Marostegui) [12:26:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:35] T171321: Finish dbstore2002 migration to multi-instance - https://phabricator.wikimedia.org/T171321 [12:28:58] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [12:31:01] RECOVERY - LVS HTTP IPv4 on logstash.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on 10.2.2.36 port 10514 [12:31:06] RECOVERY - PyBal backends health check on lvs1009 is OK: PYBAL OK - All pools are healthy [12:31:06] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [12:33:36] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [12:34:20] * gehel is also having a look at elasticsearch slowing down... [12:36:16] PROBLEM - pdfrender on scb1001 is CRITICAL: connect to address 10.64.0.16 and port 5252: Connection refused [12:37:36] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [12:37:47] (03PS1) 10Gehel: logstash: Parse nginx access logs for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/370463 [12:38:37] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [12:39:11] (03CR) 10Gehel: [C: 032] logstash: Parse nginx access logs for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/370463 (owner: 10Gehel) [12:39:18] !log restart kafka on kafka1018 to force it out of the kafka topic leaders - T172681 [12:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:29] <_joe_> !log restarting pdfrender on scb1001, T159922 [12:39:29] T172681: Analytics Kafka cluster causing timeouts to Varnishkafka since July 28th - https://phabricator.wikimedia.org/T172681 [12:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:40] T159922: pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922 [12:43:17] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [13:00:06] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170807T1300). [13:00:06] stephanebisson, TabbyCat, revi, and Amir1: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:14] o/ [13:00:16] o/ [13:00:16] available [13:00:19] hello [13:02:10] (03CR) 10Jcrespo: [C: 032] mariadb: Pool db1098 as new s6 recentchanges/watchlist host [puppet] - 10https://gerrit.wikimedia.org/r/370447 (https://phabricator.wikimedia.org/T172679) (owner: 10Jcrespo) [13:02:17] (03PS3) 10Jcrespo: mariadb: Pool db1098 as new s6 recentchanges/watchlist host [puppet] - 10https://gerrit.wikimedia.org/r/370447 (https://phabricator.wikimedia.org/T172679) [13:04:48] * revi drinks his cola [13:07:44] who's swatting today? [13:08:03] * TabbyCat eyes aude [13:09:12] I think wikimania season... [13:09:16] seems errbody is there? [13:10:53] Amir1: can you deploy? [13:11:14] I can but I'm not an official SWATer [13:11:32] is that a hard blocker? [13:11:41] (not forcing you, just asking) [13:11:45] The problem is, I don't know :D [13:11:50] lol k [13:12:19] revi: yep, WM is probably the reason; I think I saw something on Wikitech about that? [13:12:34] Wikimania is Aug 9th through 13th [13:12:38] MW Train will progress as normal [13:12:38] Service and SWAT deploys will be on a best-effort basis (if anything to deploy) [13:12:39] probably you saw this [13:12:43] ^^ [13:12:49] Right after ==Week of August 7th== [13:13:24] I think deployers would be available for murrican morning deploy and evening but that would be too late for me [13:13:31] 03:00–04:00 UTC+9 [13:13:36] that's... no...no.... [13:13:44] * Amir1 faceplams [13:13:54] I'm pinging releng [13:14:07] I would be awake at 8AM (evening swat) but I have to fly for Wikimania at that time [13:14:09] :P [13:14:14] so now or Wed [13:14:21] oh revi I'll see you there then [13:14:28] :D [13:14:33] (joke) [13:14:38] telepathy [13:15:15] !log reboot db1098 [13:15:22] <_joe_> yes, SWAT windows should be done by official swatters only without explicit permission or UBN! tickets [13:15:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:56] Hello everyone, welcome to my first SWAT [13:16:05] I hope we don't crash to anywhere :D [13:16:08] so is he authorized? [13:16:09] *clap* *clap* [13:16:16] <_joe_> I have no idea :P [13:16:23] lol [13:16:29] I just asked in releng [13:16:33] <_joe_> cool! [13:16:42] * revi applauds [13:17:14] revi: I prefer to start with you for timezone reasons [13:17:23] thanks :D [13:17:37] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370363 (https://phabricator.wikimedia.org/T172630) (owner: 10Revi) [13:17:54] (03CR) 10Ladsgroup: Enable wgMinervaEnableSiteNotice for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370363 (https://phabricator.wikimedia.org/T172630) (owner: 10Revi) [13:18:19] Wait a sec, I forgot my yubikey [13:18:24] sure [13:20:09] revi: TabbyCat , sorry, my yubikey is in hotel, I can't login to prod at all [13:20:16] * revi :O [13:20:34] that's fine, the task itself isn't urgent, I can do that @ montreal [13:20:43] wonder if _joe_ can deploy [13:20:47] preferably @hackathon [13:21:03] I use a hardware key to login, I should keep it with myself all the time [13:21:10] sorry [13:21:45] no need to be sorry :D [13:22:03] so I think today's european deploy is gone, rescheduling [13:22:10] jynus: ¿tu tienes permisos para deploy? [13:22:16] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:22:35] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:22:45] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:22:45] PROBLEM - salt-minion processes on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:22:46] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:22:51] revi: do you want me to schedule it for later SWAT today? I have something to deploy too, can babysit yours (If it doesn't require knowing Korean) [13:22:52] my patch was not urgent, I'll let it ride the train this week [13:22:55] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:23:05] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:23:05] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:23:17] Amir1: you just need to check if the same text in web sitenotice is visible on mobile too [13:23:38] (mobile web) [13:24:00] "향후 15년의 위키미디어의 미래를 논의하는 위키미디어 2030 전략 토론이 진행되고 있습니다." ? [13:24:03] current kowiki sitenotice has English in it (Altostratus) so it should be possible to compare it [13:24:06] two lines [13:24:13] 향후 15년의 위키미디어의 미래를 논의하는 위키미디어 2030 전략 토론이 진행되고 있습니다. [13:24:13] 사용자:Altostratus에 대한 관리자 선거가 2017년 8월 8일 (화) 14:09 (KST)까지 진행됩니다. [13:24:43] okay [13:24:50] That doesn't seem bad [13:25:03] Maybe just check Altostratus, numbers, KST :P [13:25:20] Is James_F also at Wikimania? [13:25:42] https://wikimania2017.wikimedia.org/wiki/Template:Attendees/100 probably yes [13:26:03] :D [13:26:09] I guess this week it'll be complicated to do anything [13:26:19] likely [13:26:35] and tomorrow I have the phone company migrating my ADSL to fiber and it'll likely take the whole day [13:30:41] Amir1: well, I guess we need to revert revi's patch now that it's merged? [13:30:47] it wasn't [13:30:54] TabbyCat: I didn't let it merge [13:30:57] it's cr+2 [13:30:59] ah [13:31:09] he removed +2 when he was looking for his key [13:31:46] otoh Amir1 https://phabricator.wikimedia.org/T172641#3505730 <-- how should I check that? [13:32:13] I mean, I use labels.wmflabs interface [13:32:33] TabbyCat: https://translatewiki.net/wiki/Special:Translate?action=translate&group=wiki-ai-wikilabels-form-dagf&language=es&filter=%21translated [13:33:04] okay, I'll do a quick review [13:33:05] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [13:33:47] Amir1: looks good to me [13:33:52] (03PS1) 10Jcrespo: mariadb: Add db1098 to the list of available s6 hosts [software] - 10https://gerrit.wikimedia.org/r/370465 (https://phabricator.wikimedia.org/T172679) [13:34:05] PROBLEM - Check the NTP synchronisation status of timesyncd on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:34:45] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [13:34:49] TabbyCat: So we need to wait that this gets to wikilabels [13:34:55] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [13:34:55] RECOVERY - salt-minion processes on stat1005 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [13:35:00] probably by tomorrow [13:35:05] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [13:35:05] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [13:35:06] RECOVERY - DPKG on stat1005 is OK: All packages OK [13:35:15] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [13:35:21] (03CR) 10Jcrespo: [C: 032] mariadb: Add db1098 to the list of available s6 hosts [software] - 10https://gerrit.wikimedia.org/r/370465 (https://phabricator.wikimedia.org/T172679) (owner: 10Jcrespo) [13:35:24] Amir1: alright, cool [13:35:41] probably when a translation is missing it should fallback to English [13:35:46] (03PS2) 10Andrew Bogott: toolschecker: use the new puppetmaster for manifest checks [puppet] - 10https://gerrit.wikimedia.org/r/370251 (https://phabricator.wikimedia.org/T171786) [13:36:05] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] [13:37:55] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [13:38:55] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy [13:39:24] (03CR) 10Andrew Bogott: [C: 032] toolschecker: use the new puppetmaster for manifest checks [puppet] - 10https://gerrit.wikimedia.org/r/370251 (https://phabricator.wikimedia.org/T171786) (owner: 10Andrew Bogott) [13:40:36] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [13:40:55] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [13:41:55] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [13:41:56] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [13:41:56] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [13:42:56] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy [13:43:46] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy [13:44:56] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [13:45:26] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [13:45:26] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [13:45:56] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [13:46:05] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [13:46:35] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy [13:46:35] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy [13:47:22] what whas the issue with recommendation_api ? [13:48:15] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [13:48:16] (03PS1) 10Giuseppe Lavagetto: puppet-compiler: further work for puppetdb support [puppet] - 10https://gerrit.wikimedia.org/r/370466 [13:48:29] Amir1: maybe... you could babysit my patches as well? [13:48:33] might be that the testing endpoints have been removed? [13:48:40] (200 --> 404) [13:48:44] I'll try to be around [13:48:55] TabbyCat: first, can you check the wikilabels for eswiki? [13:49:02] sure [13:49:05] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [13:49:06] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [13:49:06] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [13:49:06] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy [13:49:09] elukey: like data not existing due to edits? [13:49:35] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [13:49:35] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [13:49:35] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [13:49:38] jynus: something like that, but it is only a speculation. Checking on the host [13:50:05] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy [13:50:05] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy [13:50:12] TabbyCat: Yeah sure [13:50:35] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy [13:50:36] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy [13:50:36] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy [13:51:34] https://phabricator.wikimedia.org/F8980254 <-- Amir1 now I see this [13:51:56] TabbyCat: yeah, that was the plan [13:52:04] 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3505831 (10Papaul) @elukey do you have any log for me? [13:52:06] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy [13:52:12] oh argh [13:52:16] vandals on wikitech [13:52:30] * TabbyCat heads for the broom [13:52:56] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [13:53:55] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy [13:54:24] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet-compiler: further work for puppetdb support [puppet] - 10https://gerrit.wikimedia.org/r/370466 (owner: 10Giuseppe Lavagetto) [13:54:33] (03PS2) 10Giuseppe Lavagetto: puppet-compiler: further work for puppetdb support [puppet] - 10https://gerrit.wikimedia.org/r/370466 [13:56:16] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [13:56:25] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:56:26] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:56:43] I am on it --^ [13:56:55] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:57:05] PROBLEM - salt-minion processes on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:57:05] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:57:15] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:57:16] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy [13:57:25] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [14:01:35] o/ _joe_ [14:01:51] Do you have some time today to look at ORES stress tests with me :) [14:03:09] Related, if anyone could give me a review of https://gerrit.wikimedia.org/r/#/c/369915/, I could probably continue on my own. [14:03:20] Not quite sure I've done that the right way. [14:03:35] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [14:03:55] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [14:04:05] RECOVERY - Check the NTP synchronisation status of timesyncd on stat1005 is OK: OK: synced at Mon 2017-08-07 14:03:56 UTC. [14:04:09] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [14:04:09] RECOVERY - salt-minion processes on stat1005 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [14:04:16] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [14:04:16] 10Operations, 10ORES, 10Scoring-platform-team, 10Patch-For-Review, 10User-Joe: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3505857 (10Halfak) Looks like we have missed the scheduled time. I'm just waiting on review of the above patchset so that I can continue testi... [14:04:25] RECOVERY - DPKG on stat1005 is OK: All packages OK [14:04:25] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [14:13:29] (03PS1) 10Giuseppe Lavagetto: puppet-compiler: add environment to exec [puppet] - 10https://gerrit.wikimedia.org/r/370468 [14:13:32] 10Operations, 10Mail: Install missing Spamassassin DKIM dependencies on lists and mx - https://phabricator.wikimedia.org/T172689#3505877 (10herron) [14:13:59] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] puppet-compiler: add environment to exec [puppet] - 10https://gerrit.wikimedia.org/r/370468 (owner: 10Giuseppe Lavagetto) [14:21:09] 10Operations, 10Mail: Install missing Spamassassin DKIM dependencies on lists and mx - https://phabricator.wikimedia.org/T172689#3505954 (10herron) Installed libmail-dkim-perl and restarted spamassassin service fermium:~# spamassassin -D --lint dbg: diag: [...] module installed: Mail::DKIM, version 0.4... [14:21:18] (03PS2) 10Andrew Bogott: shinken: test the new labs puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/370252 (https://phabricator.wikimedia.org/T171786) [14:22:09] !log mx[1,2]001, fermium: Installed libmail-dkim-perl and restarted spamassassin service - T172689 [14:22:17] 10Operations, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): Switch to new labs puppetmasters - https://phabricator.wikimedia.org/T171786#3505957 (10Andrew) [14:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:19] T172689: Install missing Spamassassin DKIM dependencies on lists and mx - https://phabricator.wikimedia.org/T172689 [14:22:23] (03CR) 10Andrew Bogott: [C: 032] shinken: test the new labs puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/370252 (https://phabricator.wikimedia.org/T171786) (owner: 10Andrew Bogott) [14:22:53] 10Operations, 10Mail: Install missing Spamassassin DKIM dependencies on lists and mx - https://phabricator.wikimedia.org/T172689#3505961 (10herron) 05Open>03Resolved [14:25:14] 10Operations, 10Discovery, 10Discovery-Analysis, 10Maps, and 2 others: What is a reasonable per-IP ratelimit for maps - https://phabricator.wikimedia.org/T169175#3505968 (10Gehel) Over the last 30 days, backend requests [[ https://grafana-admin.wikimedia.org/dashboard/db/maps-performances?panelId=4&fullscr... [14:26:14] 10Operations, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T167157#3505969 (10Andrew) 05Open>03Resolved This is up and working. [14:28:25] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [14:32:11] !log phab2001 - stopping Apache,schedule downtime for http and puppet [14:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:25] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [14:34:35] RECOVERY - Disk space on stat1005 is OK: DISK OK [14:35:41] 10Operations, 10Domains, 10Traffic, 10Wikimedia Resource Center, 10Patch-For-Review: Create resources.wikimedia.org as a redirect - https://phabricator.wikimedia.org/T172417#3505985 (10mcruzWMF) >>! In T172417#3504034, @Reedy wrote: > I don't disagree with Timo above, and I'm guessing #operations will ag... [14:36:25] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [14:38:21] (03PS1) 10Andrew Bogott: wikitech-static monitoring: check much less frequently [puppet] - 10https://gerrit.wikimedia.org/r/370472 (https://phabricator.wikimedia.org/T168962) [14:38:35] !log updated librdkafka1 and ++1 to 0.9.4.1 on hafnium [14:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:02] (03CR) 10Andrew Bogott: [C: 032] wikitech-static monitoring: check much less frequently [puppet] - 10https://gerrit.wikimedia.org/r/370472 (https://phabricator.wikimedia.org/T168962) (owner: 10Andrew Bogott) [14:39:26] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [14:41:34] 10Operations, 10Cloud-Services, 10Patch-For-Review: wikitech-static sync check shouldn't happen so often - https://phabricator.wikimedia.org/T168962#3506019 (10Andrew) 05Open>03Resolved [14:45:35] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [14:45:57] 10Operations, 10Wikimania-Hackathon-2017-Organization, 10Release-Engineering-Team (Watching / External): Wikimania needs hosting on a server for onsite conference guide - https://phabricator.wikimedia.org/T172217#3506028 (10Dzahn) @Antoine2711 Is this working for you? Any issues? I will be in travel to Wikim... [14:51:26] !log reducing elasticsearch eqiad concurrent rebalance to 4 (from 8) [14:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:47] (03PS1) 10Andrew Bogott: increase retries for check_nova_compute_process [puppet] - 10https://gerrit.wikimedia.org/r/370474 (https://phabricator.wikimedia.org/T171606) [14:55:04] (03CR) 10Andrew Bogott: [C: 032] increase retries for check_nova_compute_process [puppet] - 10https://gerrit.wikimedia.org/r/370474 (https://phabricator.wikimedia.org/T171606) (owner: 10Andrew Bogott) [14:58:06] 10Operations, 10Domains, 10Traffic, 10Wikimedia Resource Center, 10Patch-For-Review: Create resources.wikimedia.org as a redirect - https://phabricator.wikimedia.org/T172417#3506077 (10mcruzWMF) @Reedy @Krinkle Would it be possible to implement this by tomorrow (Tuesday August 8), because if so we would... [14:59:31] (03PS1) 10Ottomata: Allow rsync to dataset1001 for pagecounts-ez [puppet] - 10https://gerrit.wikimedia.org/r/370478 (https://phabricator.wikimedia.org/T152712) [15:00:41] (03CR) 10Ottomata: [C: 032] Allow rsync to dataset1001 for pagecounts-ez [puppet] - 10https://gerrit.wikimedia.org/r/370478 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [15:07:06] (03CR) 10Jcrespo: "These are truly awful grants- We will get rid of most of these, but we need time." [puppet] - 10https://gerrit.wikimedia.org/r/369832 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [15:08:14] (03PS1) 10Jcrespo: mariadb: Add db1098 as new s6 recentchanges/watchlist/... replica [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370480 (https://phabricator.wikimedia.org/T171027) [15:10:59] !log restarting jenkins for plugin upgrade [15:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:02] (03PS2) 10Jcrespo: mariadb: Add db1098 as new s6 recentchanges/watchlist/... replica [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370480 (https://phabricator.wikimedia.org/T171027) [15:14:39] (03PS3) 10Jcrespo: mariadb: Add db1098 as new s6 recentchanges/watchlist/... replica [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370480 (https://phabricator.wikimedia.org/T171027) [15:15:52] (03PS4) 10Jcrespo: mariadb: Add db1098 as new s6 recentchanges/watchlist/... replica [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370480 (https://phabricator.wikimedia.org/T171027) [15:30:33] (03PS1) 10Jgreen: unsubscribe awight from fr-tech icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/370483 [15:31:14] (03PS2) 10Jgreen: unsubscribe awight from fr-tech icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/370483 (https://phabricator.wikimedia.org/T170437) [15:31:49] (03CR) 10jerkins-bot: [V: 04-1] unsubscribe awight from fr-tech icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/370483 (https://phabricator.wikimedia.org/T170437) (owner: 10Jgreen) [15:35:27] (03PS3) 10Jgreen: unsubscribe awight from fr-tech icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/370483 (https://phabricator.wikimedia.org/T170437) [15:36:13] (03CR) 10Jgreen: [C: 032] unsubscribe awight from fr-tech icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/370483 (https://phabricator.wikimedia.org/T170437) (owner: 10Jgreen) [15:49:08] 10Operations, 10Analytics-Kanban, 10User-Elukey: Analytics Kafka cluster causing timeouts to Varnishkafka since July 28th - https://phabricator.wikimedia.org/T172681#3506357 (10elukey) From https://apache.googlesource.com/kafka/+/refs/heads/trunk/clients/src/main/java/org/apache/kafka/common/protocol/ApiKeys... [15:49:53] (03PS1) 10Herron: Add SPF and DKIM perl package requires to spamassassin class [puppet] - 10https://gerrit.wikimedia.org/r/370487 (https://phabricator.wikimedia.org/T172689) [15:51:19] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [15:51:24] (03PS5) 10Gehel: wdqs - remove upstart configuration files [puppet] - 10https://gerrit.wikimedia.org/r/369688 [15:52:08] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [15:52:28] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy [15:55:15] (03PS1) 10Marostegui: realm.pp: Add to oauth tables to the private list [puppet] - 10https://gerrit.wikimedia.org/r/370489 (https://phabricator.wikimedia.org/T172693) [15:56:55] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10User-Elukey: Analytics1034 eth0 negotiated speed to 100Mb/s instead of 1000Mb/s - https://phabricator.wikimedia.org/T172633#3506395 (10Nuria) [15:56:59] (03CR) 10Jcrespo: [C: 031] realm.pp: Add to oauth tables to the private list [puppet] - 10https://gerrit.wikimedia.org/r/370489 (https://phabricator.wikimedia.org/T172693) (owner: 10Marostegui) [15:57:36] (03CR) 10Jcrespo: [C: 031] realm.pp: Add to oauth tables to the private list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/370489 (https://phabricator.wikimedia.org/T172693) (owner: 10Marostegui) [15:58:20] (03PS2) 10Marostegui: realm.pp: Add two oauth tables to the private list [puppet] - 10https://gerrit.wikimedia.org/r/370489 (https://phabricator.wikimedia.org/T172693) [16:00:06] (03CR) 10Marostegui: [C: 032] realm.pp: Add two oauth tables to the private list [puppet] - 10https://gerrit.wikimedia.org/r/370489 (https://phabricator.wikimedia.org/T172693) (owner: 10Marostegui) [16:06:18] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 2033659 [16:06:18] ACKNOWLEDGEMENT - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] Ayounsi https://phabricator.wikimedia.org/T169498 [16:08:24] 10Operations, 10Analytics, 10Research: Phase out and replace analytics-store (multisource) - https://phabricator.wikimedia.org/T172410#3497641 (10Nuria) Not sure what do we need to do here. What is on analytics store apart from eventlogging and mediawiki databases? [16:08:51] (03CR) 10Gehel: [C: 04-1] wdqs - remove upstart configuration files [puppet] - 10https://gerrit.wikimedia.org/r/369688 (owner: 10Gehel) [16:12:02] <_joe_> gehel: tell me when should I take a look at those changes, btw [16:12:18] _joe_: yep, I'll poing you when ready! [16:12:35] <_joe_> thanks for working on that :) [16:12:48] my pleasure (well, to some extent...) [16:12:53] <_joe_> eheh [16:13:24] <_joe_> I hope that by thursday we'll also have full puppetdb support in the compiler [16:14:58] (03CR) 10Dzahn: "our only blocker here is currently that phab fails to create the weekly stats mail with "ERROR 1698 (28000): Access denied for user 'phsta" [puppet] - 10https://gerrit.wikimedia.org/r/369832 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [16:20:57] (03CR) 10Paladox: [C: 031] mariadb/phabricator: update GRANTS from iridium to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/369832 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [16:24:18] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [16:31:19] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [16:33:23] 10Operations, 10Discovery, 10Discovery-Analysis, 10Maps, and 2 others: What is a reasonable per-IP ratelimit for maps - https://phabricator.wikimedia.org/T169175#3506554 (10BBlack) It's just per-IP. So yes that sounds fine: if you're peaking at 80/s total, then lets put an upper sanity bound at 100/s miss... [16:34:42] (03PS1) 10EBernhardson: Enable max token count for phrase rescore on zh lang wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370497 (https://phabricator.wikimedia.org/T169498) [16:36:23] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 0 [16:36:50] (03CR) 10Awight: [C: 04-1] Adds hieradata for ores::celery::workers with default. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/369915 (https://phabricator.wikimedia.org/T169246) (owner: 10Halfak) [16:37:03] !log manually restarted varnish on cp1099 [16:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:39] ACKNOWLEDGEMENT - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [1000.0] Ayounsi T169498 [16:39:03] 10Operations, 10Domains, 10Traffic, 10Wikimedia Resource Center, 10Patch-For-Review: Create resources.wikimedia.org as a redirect - https://phabricator.wikimedia.org/T172417#3506569 (10Reedy) It's not on either of us, at this point, it's on #operations to do the review/merging/deployment Though, as you... [16:39:28] (03PS1) 10Dzahn: phabricator: open firewall holes only on active_server [puppet] - 10https://gerrit.wikimedia.org/r/370498 [16:39:52] (03CR) 10jerkins-bot: [V: 04-1] phabricator: open firewall holes only on active_server [puppet] - 10https://gerrit.wikimedia.org/r/370498 (owner: 10Dzahn) [16:40:13] (03Abandoned) 10Reedy: Add resources.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/369971 (https://phabricator.wikimedia.org/T172417) (owner: 10Reedy) [16:41:25] XioNoX: thanks for the ack! we should have a temporary fix deployed in a few hours... [16:44:50] !log Restart s7 instance on db1069 to pick up new replication filters - T172693 [16:45:00] (03PS2) 10Reedy: Redirect wikimedia.org/resources to meta [puppet] - 10https://gerrit.wikimedia.org/r/369970 (https://phabricator.wikimedia.org/T172417) [16:45:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:43] (03CR) 10Reedy: "PS2 swaps it to something specific after the / and rebases it" [puppet] - 10https://gerrit.wikimedia.org/r/369970 (https://phabricator.wikimedia.org/T172417) (owner: 10Reedy) [16:45:45] (03PS2) 10Gilles: Serve a synth error page when error body is empty in Varnish [puppet] - 10https://gerrit.wikimedia.org/r/365589 (https://phabricator.wikimedia.org/T169683) [16:46:10] 10Operations, 10Phabricator, 10Traffic, 10Release-Engineering-Team (Kanban): Verify that the codfw lvs is configured correctly for Phabricator - https://phabricator.wikimedia.org/T168699#3506599 (10mmodell) phab2001 web works, git-ssh still unknown. [16:46:20] (03PS2) 10Dzahn: phabricator: open firewall holes only on active_server [puppet] - 10https://gerrit.wikimedia.org/r/370498 [16:46:40] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Please use require_package instead" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/370487 (https://phabricator.wikimedia.org/T172689) (owner: 10Herron) [16:47:59] (03CR) 10Jcrespo: [C: 04-1] "I will leave db1098 partitioning overnight, and maybe it can be pooled tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370480 (https://phabricator.wikimedia.org/T171027) (owner: 10Jcrespo) [16:48:09] (03PS3) 10Dzahn: phabricator: open firewall holes only on active_server [puppet] - 10https://gerrit.wikimedia.org/r/370498 [16:48:26] 10Operations, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Performance-Team, and 5 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3506604 (10mmodell) [16:50:07] <_joe_> Reedy: that would only work for wikimedia.org/resources [16:50:14] <_joe_> not for www.wikimedia.org/resources [16:50:20] <_joe_> is that expected? [16:50:59] <_joe_> also, I have a meeting at 11 pm, I wanna get off the clock now [16:51:59] Pass [16:52:04] None of the others do [16:52:28] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [16:52:42] https://github.com/wikimedia/puppet/blob/production/modules/mediawiki/files/apache/sites/redirects/redirects.dat#L442-L443 [16:53:08] I don't see anythat that redirects from www.wikimedia.org, only t [16:53:09] to [16:53:41] <_joe_> Reedy: ok, I'm still logging off for now [16:53:44] heh [16:54:02] <_joe_> but there's plenty of ops still online I bet [16:54:21] Who might just go "apache? lolno" [16:54:23] * Reedy grins [16:54:45] _joe_: Fancy sticking a CR +1 on it regardless so they know someone else has looked at it please? [16:55:51] (03CR) 10Giuseppe Lavagetto: [C: 031] "seems to do what it's designed for" [puppet] - 10https://gerrit.wikimedia.org/r/369970 (https://phabricator.wikimedia.org/T172417) (owner: 10Reedy) [16:55:58] cheers! [16:57:46] Reedy: yeah www.wikimedia.org vs wikimedia.org as HTTP hostnames is actually a separate thorny issue... [16:57:58] heh [16:58:06] I'll comment on ticket to be explicit that www. won't work [16:58:07] there's one or two open tickets about it [16:59:08] 10Operations, 10Domains, 10Traffic, 10Wikimedia Resource Center, 10Patch-For-Review: Create resources.wikimedia.org as a redirect - https://phabricator.wikimedia.org/T172417#3506650 (10Reedy) Just a heads up, `www.wikimedia.org/resources` will not work, but `wikimedia.org/resources` will So please put `... [16:59:11] https://phabricator.wikimedia.org/T133178 [16:59:13] ^ is one [16:59:28] I meant on the ticket for this, be explicit about what they should print [17:00:04] yeah I just don't know which is actually more appropriate [17:00:04] gehel: Dear anthropoid, the time has come. Please deploy Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170807T1700). [17:00:50] I think currently "wikimedia.org/" redirects to "www.wikimedia.org/" (if no URL path), which has our "here's our projects" landing page [17:00:53] jouncebot: o/ [17:01:36] www.wikimedia.org has some global API stuff (as in global to projects/languages)? [17:01:49] wikimedia.org has basically-nothing at present I think, except the base URL redirect to www [17:02:16] (03PS4) 10Dzahn: phabricator: open firewall holes only on active_server [puppet] - 10https://gerrit.wikimedia.org/r/370498 (https://phabricator.wikimedia.org/T137928) [17:02:17] but then RB is currently backwards from that (subject of the ticket above) [17:02:37] !log gehel@tin Started deploy [wdqs/wdqs@da33919]: (no justification provided) [17:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:52] but I think the consensus on that ticket is to move RB to www as well [17:03:18] (03CR) 10Paladox: [C: 031] phabricator: open firewall holes only on active_server [puppet] - 10https://gerrit.wikimedia.org/r/370498 (https://phabricator.wikimedia.org/T137928) (owner: 10Dzahn) [17:03:50] to me it seems like "wikimedia.org/resources" is more logical than "www.wikimedia.org/resources" for this, but I could see someone involved arguing the opposite maybe. I don't know. [17:05:05] !log gehel@tin Finished deploy [wdqs/wdqs@da33919]: (no justification provided) (duration: 02m 28s) [17:05:05] www.wikimedia.org is more like some kind of meta-wiki (in the real sense of technical-meta to projects/langs, rather than the more abstract content/community-meta of meta.wikimedia.org) [17:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:53] SMalyshev: deployment completed, tests are green [17:06:10] * gehel is keeping a look at error rate, see if we don't throttle too many... [17:06:14] detects "labs" string on www.wikimedia.org [17:06:19] bd808: :) [17:08:19] maybe we should have some sort of official Public URI Namespace Bikeshedding Committee that makes consistent policies and decisions about all related things :) [17:09:36] yea. requesting wiki at bikeshedding.committee.wikimedia.org [17:09:53] gehel: thanks! let's monitor it for a while [17:10:33] gehel: I see logging patch is also merged? [17:10:37] !log Restart s7 instance on db1102 to pick up new replication filters - T172693 [17:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:18] 10Operations, 10Discovery, 10Discovery-Analysis, 10Maps, and 2 others: What is a reasonable per-IP ratelimit for maps - https://phabricator.wikimedia.org/T169175#3506699 (10Gehel) @BBlack I'm probably the one who should be around. I can be available any time from 10am to 11pm CEST (1am to 2pm PT). Just let... [17:12:00] (03CR) 10Jcrespo: "@Dzhan: change the dns of m3-slave to point to the same server than m3-master, that will fix the issue, and I will fix the existing mess w" [puppet] - 10https://gerrit.wikimedia.org/r/369832 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [17:15:57] SMalyshev: lgostash patch is merged, and it looks like the IP/UA also made it [17:16:34] (03CR) 10Dzahn: [C: 032] "tested with apache-fast-test on mwdebug1001" [puppet] - 10https://gerrit.wikimedia.org/r/369970 (https://phabricator.wikimedia.org/T172417) (owner: 10Reedy) [17:18:44] (03PS1) 10Ayounsi: Bumping HP RAID Icinga check timeout from 60 to 90s [puppet] - 10https://gerrit.wikimedia.org/r/370505 (https://phabricator.wikimedia.org/T172708) [17:21:58] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [17:21:58] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [17:21:59] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [17:21:59] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [17:21:59] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [17:22:04] (03PS1) 10Dzahn: point m3-slave to same server as m3-master [dns] - 10https://gerrit.wikimedia.org/r/370506 [17:22:08] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [17:22:18] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [17:22:20] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [17:22:20] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [17:22:30] uh oh [17:22:38] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [17:22:39] seems a config issue? [17:23:50] (03CR) 10Dzahn: "@jcrespo to dbproxy like this? https://gerrit.wikimedia.org/r/#/c/370506/1/templates/wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/369832 (https://phabricator.wikimedia.org/T163938) (owner: 10Dzahn) [17:24:08] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [17:24:08] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy [17:24:18] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy [17:24:19] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy [17:24:28] robh: remmendation_api? no that one just alerts with too much sensitivity (and repetition) on cirrussearch issues [17:24:28] (03CR) 10Jcrespo: [C: 031] "We have done this in the past, when we have done maintenance on the passive host, with no issues." [dns] - 10https://gerrit.wikimedia.org/r/370506 (owner: 10Dzahn) [17:24:38] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy [17:24:51] thx for info =] [17:25:30] (03CR) 10Jcrespo: [C: 031] "In fact, we need to do maintenance and upgrade hardware here, so now it is a good time to do both." [dns] - 10https://gerrit.wikimedia.org/r/370506 (owner: 10Dzahn) [17:25:47] usually a minutes or two later the cirrus ones will alert [17:25:53] i have a partial fix going out in swat...might help [17:26:05] (03PS2) 10Dzahn: point m3-slave to same server as m3-master [dns] - 10https://gerrit.wikimedia.org/r/370506 [17:27:08] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [17:27:08] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [17:27:18] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [17:27:18] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [17:27:29] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [600.0] [17:27:39] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [17:27:39] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] [17:28:27] how is 404 an intermittent issue anyways? I could understand being oversensitive to something like 5xx which might actually be intermittent in a failure, but surely 404-ing a URL is just broken? [17:29:12] (03CR) 10Dzahn: [C: 032] point m3-slave to same server as m3-master [dns] - 10https://gerrit.wikimedia.org/r/370506 (owner: 10Dzahn) [17:29:16] or maybe this is a case of "inappropriate 404", where some service is returning a 404 when a 5xx would be more-appropriate [17:29:28] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] [17:30:00] hmm, indeed restbase should be returning a 5xx there [17:30:26] (404 shouldn't be used as "I temporarily can't contact whatever I'm proxying/querying to, so let's call it 'not found'". It should only have the public and consistent (over time) meaning "This URL is not valid and nothing lives here". [17:30:39] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy [17:31:19] ACKNOWLEDGEMENT - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1000.0] Gehel known issue, fix coming up soon - T169498 [17:31:28] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy [17:32:01] good thing I'm not a LISP programmer, I always loose track of my parens :P [17:33:18] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [17:33:18] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy [17:33:18] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy [17:33:18] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy [17:33:18] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy [17:33:28] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy [17:33:31] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy [17:34:18] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [17:35:08] jynus: i can confirm the stats script does not get the 'access denied' anymore now, on phab1001 [17:35:11] thanks [17:35:27] and phab isn't broken :) [17:35:52] I just want to do things well, or else I would never fix that mess [17:35:54] now i just have to fix a totally unrelated issue with that script that come from trusty->jessie [17:36:08] that template is really bad [17:36:38] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1001 is OK: OK: Less than 20.00% above the threshold [300.0] [17:36:46] jynus: yea, that sounds good. glad there was a quick fix :) [17:37:01] let's call it temporary workaround :-) [17:37:06] ok :) [17:40:29] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [17:43:34] (03CR) 10Ayounsi: [C: 032] Bumping HP RAID Icinga check timeout from 60 to 90s [puppet] - 10https://gerrit.wikimedia.org/r/370505 (https://phabricator.wikimedia.org/T172708) (owner: 10Ayounsi) [17:43:41] (03PS2) 10Ayounsi: Bumping HP RAID Icinga check timeout from 60 to 90s [puppet] - 10https://gerrit.wikimedia.org/r/370505 (https://phabricator.wikimedia.org/T172708) [17:44:24] jouncebot: next [17:44:25] In 0 hour(s) and 15 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170807T1800) [17:45:10] would anyone do that window this time? :) [17:45:48] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [17:48:31] TabbyCat: regarding this morning's session: SWATs are normally on a best effort basis, especially the 13:00 UTC one as there is less SWAT deployer coverage. [17:48:52] greg-g: I know + it's Wikimania time :) [17:49:03] yup :) [17:52:04] (03CR) 10Dzahn: [C: 032] "no-op on phab1001. closes firewall on phab2001 (except ssh between phab servers) http://puppet-compiler.wmflabs.org/7326/" [puppet] - 10https://gerrit.wikimedia.org/r/370498 (https://phabricator.wikimedia.org/T137928) (owner: 10Dzahn) [17:52:20] (03PS5) 10Dzahn: phabricator: open firewall holes only on active_server [puppet] - 10https://gerrit.wikimedia.org/r/370498 (https://phabricator.wikimedia.org/T137928) [17:53:08] i'll be deploying swat if noone else shows up, i have important things that have to go out :P [17:54:58] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [17:56:01] !log stopping slave and reparitioning db1098 [17:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:39] !log phab2001 - re-enabling puppet, but closing firewall for 80/443 [17:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170807T1800). [18:00:04] Amir1, TabbyCat, and Jdlrobson: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:13] o/ [18:00:16] meow [18:01:06] and ebernhardson (used irGnick instead or irC) [18:01:20] yea i just fixed that :P i'll ship this today [18:01:40] (03CR) 10EBernhardson: [C: 032] Grant 'autopatrol' to 'editor' in en.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370311 (https://phabricator.wikimedia.org/T172561) (owner: 10MarcoAurelio) [18:01:56] :D [18:02:07] (03CR) 10EBernhardson: [C: 032] Translate sitename for nl.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370313 (https://phabricator.wikimedia.org/T172594) (owner: 10MarcoAurelio) [18:03:07] (03Merged) 10jenkins-bot: Grant 'autopatrol' to 'editor' in en.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370311 (https://phabricator.wikimedia.org/T172561) (owner: 10MarcoAurelio) [18:03:20] (03CR) 10jenkins-bot: Grant 'autopatrol' to 'editor' in en.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370311 (https://phabricator.wikimedia.org/T172561) (owner: 10MarcoAurelio) [18:05:13] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Grant autopatrol to editor in en.wikibooks - T172561 (duration: 00m 47s) [18:05:15] Amir1: around? [18:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:26] T172561: Addition to the "autopatrol" right to the user group "reviewers" on the English Wikibooks - https://phabricator.wikimedia.org/T172561 [18:05:33] TabbyCat: auto patrol is out, [18:05:48] ebernhardson: live or on mwdebug? [18:05:57] (03PS2) 10EBernhardson: Translate sitename for nl.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370313 (https://phabricator.wikimedia.org/T172594) (owner: 10MarcoAurelio) [18:05:59] oh, live I see [18:06:05] * TabbyCat checks [18:06:05] (03CR) 10EBernhardson: [C: 032] Translate sitename for nl.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370313 (https://phabricator.wikimedia.org/T172594) (owner: 10MarcoAurelio) [18:06:31] https://en.wikibooks.org/wiki/Special:ListGroupRights#editor <-- looks good to me [18:07:10] (03CR) 10Dzahn: "there is a redirect to www. when accessing it from external now. it worked when testing from tin on mwdebug1001, but not now..." [puppet] - 10https://gerrit.wikimedia.org/r/369970 (https://phabricator.wikimedia.org/T172417) (owner: 10Reedy) [18:07:29] (03Merged) 10jenkins-bot: Translate sitename for nl.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370313 (https://phabricator.wikimedia.org/T172594) (owner: 10MarcoAurelio) [18:07:38] (03CR) 10jenkins-bot: Translate sitename for nl.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370313 (https://phabricator.wikimedia.org/T172594) (owner: 10MarcoAurelio) [18:07:58] TabbyCat: TabbyCat nl.wikinews on mwdebug1001 [18:08:18] checking [18:08:49] (03CR) 10EBernhardson: [C: 032] Enable max token count for phrase rescore on zh lang wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370497 (https://phabricator.wikimedia.org/T169498) (owner: 10EBernhardson) [18:09:09] ebernhardson: sitename change looks good at mwdebug1001 [18:10:07] syncing [18:10:28] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: T172594 - Translate sitename for nl.wikinews (duration: 00m 47s) [18:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:38] T172594: Change $wgSitename for Dutch Wikinews to Wikinieuws - https://phabricator.wikimedia.org/T172594 [18:10:48] rechecking [18:12:08] perfect [18:12:21] thanks ebernhardson [18:12:23] np [18:12:32] (here btw) [18:12:41] (just at end of list) [18:12:48] jdlrobson: you snuck in late to an overfull swat :P but it's my fault its overfull so i guess we can try... :) [18:12:52] ebernhardson, can I add one change to Morning SWAT? [18:12:53] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [18:12:59] i was patch number 8 i believe :) [18:13:07] Urbanecm: swat is already over full, can check once everything else is out if there is still time [18:13:28] jdlrobson: well, somehow there are 10 patches :P [18:13:38] ebernhardson: mine was before yours :P https://wikitech.wikimedia.org/w/index.php?title=Deployments&action=history [18:13:43] ebernhardson, oh, I see. Okay, please ping me after you deploy all patches. I'll use another window if it won't be possible [18:13:50] hehe [18:14:02] abuse :O :P [18:14:08] (joking) [18:14:10] jdlrobson: :P i must not have seen yours below the marker [18:14:16] yeh iput it underneath that was my fail [18:14:24] hate editing that wiki page... :) [18:14:33] is there a bot for it btw [18:14:37] that would be so cool.. [18:14:44] ++++1 that [18:15:16] jdlrobson: you might find T171940 useful I think [18:15:16] T171940: Create a Gadget to easily add/remove/modify patches for SWAT at wikitech:Deployments - https://phabricator.wikimedia.org/T171940 [18:15:20] jdlrobson: waiting for some things to merge now, then you'll be up [18:16:02] (03PS1) 10Smalyshev: Some requests may have no client IP defined. [puppet] - 10https://gerrit.wikimedia.org/r/370511 (https://phabricator.wikimedia.org/T172713) [18:17:22] ebernhardson: I'm around now [18:17:43] sorry, completely forgot about swat [18:18:04] no worries, i do that too, or i get distracted while other things are deploying in swat and miss when my patch comes up... [18:18:45] errrrrrr being an owl at 3am [18:18:52] (phone now tho) [18:20:27] 10Operations, 10Analytics, 10Research: Phase out and replace analytics-store (multisource) - https://phabricator.wikimedia.org/T172410#3507261 (10Halfak) Looks like @jcrespo wants to phase out an analytics/dba maintained resource. I guess I'd expect analytics to lead the process of phasing that out. [18:20:57] (03CR) 10EBernhardson: [C: 032] Enable max token count for phrase rescore on zh lang wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370497 (https://phabricator.wikimedia.org/T169498) (owner: 10EBernhardson) [18:21:17] 10Operations, 10monitoring: fix librenms LE check for netmon2001 - https://phabricator.wikimedia.org/T172712#3507265 (10Dzahn) a:03Dzahn [18:23:27] !log ebernhardson@tin Synchronized php-1.30.0-wmf.12/extensions/CirrusSearch/: T169498 limit phrase token count, T172464 constant boost ltr queries (duration: 00m 58s) [18:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:41] T169498: Investigate load spikes on the elasticsearch cluster in eqiad - https://phabricator.wikimedia.org/T169498 [18:23:41] T172464: Problems with MLR and small rescore windows - https://phabricator.wikimedia.org/T172464 [18:26:06] (03PS2) 10EBernhardson: Exclude files from Special:ShortPages on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369503 (https://phabricator.wikimedia.org/T170687) (owner: 10Jdlrobson) [18:28:12] \o/ [18:28:21] (03CR) 10EBernhardson: [C: 032] Exclude files from Special:ShortPages on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369503 (https://phabricator.wikimedia.org/T170687) (owner: 10Jdlrobson) [18:28:36] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0 [18:29:54] (03Merged) 10jenkins-bot: Exclude files from Special:ShortPages on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369503 (https://phabricator.wikimedia.org/T170687) (owner: 10Jdlrobson) [18:30:09] (03CR) 10jenkins-bot: Exclude files from Special:ShortPages on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369503 (https://phabricator.wikimedia.org/T170687) (owner: 10Jdlrobson) [18:30:33] jdlrobson: you're up on mwdebug1001 [18:30:35] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:30:39] (03CR) 10EBernhardson: [C: 032] Enable wgMinervaEnableSiteNotice for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370363 (https://phabricator.wikimedia.org/T172630) (owner: 10Revi) [18:30:45] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [18:30:58] ebernhardson: testing [18:31:00] (please know that I can't test for being mobile) [18:31:05] (03PS4) 10EBernhardson: Enable wgMinervaEnableSiteNotice for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370363 (https://phabricator.wikimedia.org/T172630) (owner: 10Revi) [18:31:15] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:31:29] revi: you can get minerva without being on a mobile device [18:32:12] I mean, X-wikimedia-debug [18:32:17] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:32:36] Whats up with these 5xx's ? [18:33:04] looks like luasandbox in fatalmonitor? [18:33:05] and what I meant... https://usercontent.irccloud-cdn.com/file/kDeWIfnz/IMG_2279.PNG [18:33:07] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 27 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [18:33:18] oh, no not luasandbox. They all have the message %{message}. Very useful [18:33:27] ebernhardson: my patch LGTM [18:33:54] jdlrobson: sec to see if these 5xx alerts clear [18:34:09] they look in fatalmonitor to have only been for ~10s [18:36:41] (03CR) 10EBernhardson: [C: 032] Enable wgMinervaEnableSiteNotice for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370363 (https://phabricator.wikimedia.org/T172630) (owner: 10Revi) [18:37:26] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:38:07] (03Merged) 10jenkins-bot: Enable wgMinervaEnableSiteNotice for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370363 (https://phabricator.wikimedia.org/T172630) (owner: 10Revi) [18:38:13] jdlrobson: syncing out [18:38:20] (03CR) 10jenkins-bot: Enable wgMinervaEnableSiteNotice for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370363 (https://phabricator.wikimedia.org/T172630) (owner: 10Revi) [18:38:26] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:38:41] (03PS2) 10EBernhardson: beta: Add copyright info for Wikidata API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370455 (https://phabricator.wikimedia.org/T112606) (owner: 10Ladsgroup) [18:38:45] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: T170687 - Exclude files from Special:ShortPages on commons (duration: 00m 46s) [18:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:56] T170687: [[special:ShortPages]] includes file pages on Commons - https://phabricator.wikimedia.org/T170687 [18:39:43] revi: Amir1 : kowiki config is up on mwdebug1001 [18:39:46] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:40:15] kk... (needs a min) [18:40:30] (03CR) 10EBernhardson: [C: 032] beta: Add copyright info for Wikidata API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370455 (https://phabricator.wikimedia.org/T112606) (owner: 10Ladsgroup) [18:40:56] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:41:26] ebernhardson: worksforme [18:41:59] (03Merged) 10jenkins-bot: beta: Add copyright info for Wikidata API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370455 (https://phabricator.wikimedia.org/T112606) (owner: 10Ladsgroup) [18:42:07] (03CR) 10jenkins-bot: beta: Add copyright info for Wikidata API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370455 (https://phabricator.wikimedia.org/T112606) (owner: 10Ladsgroup) [18:42:30] revi: kk, syncing out [18:42:50] Amir1: i'm just going to sync out the other, since its a labs only change [18:42:58] back [18:43:05] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: T172630 - Enable wgMinervaEnableSiteNotice for kowiki (duration: 00m 46s) [18:43:06] sorry [18:43:07] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 0 probes of 275 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [18:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:14] T172630: Enable wgMinervaEnableSiteNotice for kowiki - https://phabricator.wikimedia.org/T172630 [18:44:01] works on prod too [18:44:04] it seems [18:44:27] Amir1: your other change should show up on beta within 5 minutes, iiuc [18:44:48] (03PS2) 10EBernhardson: Enable max token count for phrase rescore on zh lang wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370497 (https://phabricator.wikimedia.org/T169498) [18:44:49] revi: so you're up :) Thanks [18:44:53] ebernhardson: Thanks [18:44:54] yeah :P [18:44:57] I couldn't sleep [18:44:58] !log ebernhardson@tin Synchronized wmf-config/Wikibase-labs.php: T112606 - beta only - Add copyright info for Wikidata API (duration: 00m 46s) [18:45:07] * ebernhardson wonders why https://gerrit.wikimedia.org/r/#/c/370497/ keeps not merging but without errors ... [18:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:09] T112606: [Bug] The API query for rightsinfo on www.wikidata.org reports CC-SA 3.0 , while its page footer says CC0 as well - https://phabricator.wikimedia.org/T112606 [18:45:57] (03PS2) 10EBernhardson: Update CirrusSearch AB test rescore profiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370127 (https://phabricator.wikimedia.org/T171212) [18:48:11] (03CR) 10EBernhardson: [C: 032] Update CirrusSearch AB test rescore profiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370127 (https://phabricator.wikimedia.org/T171212) (owner: 10EBernhardson) [18:49:40] (03Merged) 10jenkins-bot: Update CirrusSearch AB test rescore profiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370127 (https://phabricator.wikimedia.org/T171212) (owner: 10EBernhardson) [18:49:50] (03CR) 10jenkins-bot: Update CirrusSearch AB test rescore profiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370127 (https://phabricator.wikimedia.org/T171212) (owner: 10EBernhardson) [18:53:22] (03CR) 10EBernhardson: [C: 032] Enable max token count for phrase rescore on zh lang wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370497 (https://phabricator.wikimedia.org/T169498) (owner: 10EBernhardson) [18:53:56] !log ebernhardson@tin Synchronized wmf-config/CirrusSearch-production.php: T171212 - Update CirrusSearch AB test rescore profiles (duration: 00m 46s) [18:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:07] T171212: Interleaved results A/B test: turn on - https://phabricator.wikimedia.org/T171212 [18:55:10] thanks ebernhardson [18:55:11] (03Merged) 10jenkins-bot: Enable max token count for phrase rescore on zh lang wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370497 (https://phabricator.wikimedia.org/T169498) (owner: 10EBernhardson) [18:56:21] (03CR) 10jenkins-bot: Enable max token count for phrase rescore on zh lang wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370497 (https://phabricator.wikimedia.org/T169498) (owner: 10EBernhardson) [18:59:17] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: T169498 - Enable max token count for phrase rescore on zh lang wikis (step 1) (duration: 00m 46s) [18:59:28] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service: send wdqs logs to logstash - https://phabricator.wikimedia.org/T172710#3507415 (10Smalyshev) p:05Triage>03Normal [18:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:29] T169498: Investigate load spikes on the elasticsearch cluster in eqiad - https://phabricator.wikimedia.org/T169498 [19:00:06] swat is running a smidgen over, but just one more patch after this syncs [19:00:24] !log ebernhardson@tin Synchronized wmf-config/CirrusSearch-common.php: T169498 - Enable max token count for phrase rescore on zh lang wikis (step 2) (duration: 00m 46s) [19:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:12] mutante twentyafterfour package heirloom-mailx should fix mail -r command on debian :) [19:05:58] !log ebernhardson@tin Synchronized php-1.30.0-wmf.12/extensions/WikimediaEvents/modules/ext.wikimediaEvents.searchSatisfaction.js: T171212 - Turn on CirrusSearch MLR AB test (duration: 00m 46s) [19:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:09] T171212: Interleaved results A/B test: turn on - https://phabricator.wikimedia.org/T171212 [19:06:44] SWAT complete [19:07:16] (03Draft1) 10Paladox: Phabricator: Install package heirloom-mailx for mail command [puppet] - 10https://gerrit.wikimedia.org/r/370518 [19:07:20] (03PS2) 10Paladox: Phabricator: Install package heirloom-mailx for mail command [puppet] - 10https://gerrit.wikimedia.org/r/370518 [19:09:00] (03CR) 10Paladox: Phabricator: Install package heirloom-mailx for mail command (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/370518 (owner: 10Paladox) [19:10:26] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [19:16:20] (03PS2) 10Herron: Add SPF and DKIM perl package requires to spamassassin class [puppet] - 10https://gerrit.wikimedia.org/r/370487 (https://phabricator.wikimedia.org/T172689) [19:16:26] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [19:17:34] 10Operations, 10DC-Ops: audit spare disk levels for codfw & eqiad utlized storage in servers - https://phabricator.wikimedia.org/T160097#3507570 (10RobH) 05Open>03Resolved dc trackign sheet has 1tb, 2tb, 4tb sata as well as ssd spares now in the 800 and 1.6tb sizes [19:19:39] (03CR) 10Herron: "Sounds good. require_package is much cleaner!" [puppet] - 10https://gerrit.wikimedia.org/r/370487 (https://phabricator.wikimedia.org/T172689) (owner: 10Herron) [19:23:40] (03PS1) 10Andrew Bogott: tools-clush-*: move to python-2 [puppet] - 10https://gerrit.wikimedia.org/r/370522 [19:47:47] 10Operations, 10Android-app-feature-Compilations, 10Reading-Infrastructure-Team-Backlog, 10Traffic, 10Wikipedia-Android-App-Backlog: Determine how to upload Zim files to Swift infrastructure - https://phabricator.wikimedia.org/T172123#3507727 (10Mholloway) [19:49:24] 10Operations, 10Android-app-feature-Compilations, 10Reading-Infrastructure-Team-Backlog, 10Wikipedia-Android-App-Backlog: Create 'pagecompilation' Swift account(s) (beta + prod) for Readers offline article compilations project - https://phabricator.wikimedia.org/T172735#3507730 (10Mholloway) [19:59:19] (03CR) 10Rush: [C: 031] "py3 we love you but no" [puppet] - 10https://gerrit.wikimedia.org/r/370522 (owner: 10Andrew Bogott) [19:59:21] (03CR) 10Andrew Bogott: [C: 032] tools-clush-*: move to python-2 [puppet] - 10https://gerrit.wikimedia.org/r/370522 (owner: 10Andrew Bogott) [20:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170807T2000). [20:00:26] 10Operations, 10LDAP-Access-Requests: Assistance with LDAP Access for Transparency Report - https://phabricator.wikimedia.org/T140380#3507816 (10JbuattiWMF) 05Resolved>03Open Hey @Dzahn, would it be possible to add AShahrestani to the WMF group? This is again so that one of our legal fellows can work on th... [20:00:32] Nothing for ORES today [20:07:49] (03PS1) 10Gehel: discovery-stats user should be a member of wikidev group [puppet] - 10https://gerrit.wikimedia.org/r/370530 (https://phabricator.wikimedia.org/T172740) [20:10:14] so what's broken that Upload isn't working on Common ? [20:10:44] "Our servers are currently under maintenance or experiencing a technical problem." which isn't helping you or me, I suspect. [20:19:13] (03CR) 10Bearloga: [C: 031] discovery-stats user should be a member of wikidev group [puppet] - 10https://gerrit.wikimedia.org/r/370530 (https://phabricator.wikimedia.org/T172740) (owner: 10Gehel) [20:19:36] PROBLEM - Disk space on stat1005 is CRITICAL: DISK CRITICAL - free space: /srv 283092 MB (3% inode=94%) [20:19:40] NotASpy, still doesn't work? [20:20:09] yeah, it uploaded without issue. [20:23:36] PROBLEM - Disk space on stat1005 is CRITICAL: DISK CRITICAL - free space: /srv 279839 MB (3% inode=94%) [20:26:30] 10Operations, 10Patch-For-Review: Standardizing our partman recipes - https://phabricator.wikimedia.org/T156955#3507975 (10fgiunchedi) >>! In T156955#3035850, @fgiunchedi wrote: > RAID/disk layer: > * either software or hardware raid > * in any case one block device is exposed (including the single-disk case,... [20:30:33] (03CR) 10Smalyshev: [C: 031] wdqs - remove upstart configuration files [puppet] - 10https://gerrit.wikimedia.org/r/369688 (owner: 10Gehel) [20:31:10] (03PS1) 10Mobrovac: Cassandra: Do not include the main DNS in the list of seeds [puppet] - 10https://gerrit.wikimedia.org/r/370554 (https://phabricator.wikimedia.org/T172610) [20:33:18] 10Operations, 10LDAP-Access-Requests: Assistance with LDAP Access for Transparency Report - https://phabricator.wikimedia.org/T140380#3508045 (10Dzahn) a:05Dzahn>03RobH @JbuattiWMF i'm in in middle of travel now @Robh could you help out by any chance? would be great, thank you! [20:40:41] (03PS1) 10RobH: adding AShahrestani to ldap per request [puppet] - 10https://gerrit.wikimedia.org/r/370555 (https://phabricator.wikimedia.org/T140380) [20:40:57] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [20:42:28] 10Operations, 10AbuseFilter, 10Traffic, 10Zero: user_wpzero doesn't always work - https://phabricator.wikimedia.org/T169907#3412425 (10zhuyifei1999) The fact: uploaders are not always in WP0 ranges, but downloaders are nearly always in WP0 ranges (Z591) [20:42:37] (03CR) 10RobH: [C: 032] adding AShahrestani to ldap per request [puppet] - 10https://gerrit.wikimedia.org/r/370555 (https://phabricator.wikimedia.org/T140380) (owner: 10RobH) [20:43:34] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Assistance with LDAP Access for Transparency Report - https://phabricator.wikimedia.org/T140380#3508104 (10JbuattiWMF) Thanks @RobH ! [20:44:25] (03PS1) 10RobH: Revert "adding AShahrestani to ldap per request" [puppet] - 10https://gerrit.wikimedia.org/r/370556 [20:45:03] (03CR) 10RobH: [C: 032] Revert "adding AShahrestani to ldap per request" [puppet] - 10https://gerrit.wikimedia.org/r/370556 (owner: 10RobH) [20:45:56] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [20:46:07] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Assistance with LDAP Access for Transparency Report - https://phabricator.wikimedia.org/T140380#3508106 (10RobH) Yeah ignore those patchsets, they weren't required. [20:47:56] (03CR) 10Mobrovac: "PCC OK - https://puppet-compiler.wmflabs.org/compiler02/7327/" [puppet] - 10https://gerrit.wikimedia.org/r/370554 (https://phabricator.wikimedia.org/T172610) (owner: 10Mobrovac) [20:55:36] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [21:00:00] (03PS3) 10Andrew Bogott: openstack: libvirtd.conf from Jessie package [1/2] [puppet] - 10https://gerrit.wikimedia.org/r/369615 (owner: 10Hashar) [21:00:04] dapatrick, bawolff, and Reedy: Respected human, time to deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170807T2100). Please do the needful. [21:00:29] (03PS2) 10Andrew Bogott: openstack: libvirtd.conf from Jessie package [2/2] [puppet] - 10https://gerrit.wikimedia.org/r/369617 (owner: 10Hashar) [21:00:46] * harej wonders why jouncebot is programmed to speak in such a mystifying tone [21:01:00] (03CR) 10Andrew Bogott: [C: 032] openstack: libvirtd.conf from Jessie package [1/2] [puppet] - 10https://gerrit.wikimedia.org/r/369615 (owner: 10Hashar) [21:01:22] (03CR) 10Andrew Bogott: [C: 032] openstack: libvirtd.conf from Jessie package [2/2] [puppet] - 10https://gerrit.wikimedia.org/r/369617 (owner: 10Hashar) [21:01:27] herron: mystifying? I thought its more like the 'totally not a robot' meme [21:02:37] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [21:02:47] mystifying indeed :D [21:05:01] (03PS1) 10RobH: add AShahrestani to admin module for inclusion in wmf group [puppet] - 10https://gerrit.wikimedia.org/r/370579 (https://phabricator.wikimedia.org/T140380) [21:05:33] (03PS2) 10RobH: add AShahrestani to admin module for inclusion in wmf group [puppet] - 10https://gerrit.wikimedia.org/r/370579 (https://phabricator.wikimedia.org/T140380) [21:05:43] (03CR) 10RobH: [C: 032] add AShahrestani to admin module for inclusion in wmf group [puppet] - 10https://gerrit.wikimedia.org/r/370579 (https://phabricator.wikimedia.org/T140380) (owner: 10RobH) [21:16:56] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Assistance with LDAP Access for Transparency Report - https://phabricator.wikimedia.org/T140380#3508188 (10RobH) 05Open>03Resolved Ok, chatted with @MoritzMuehlenhoff who was able to clarify. We include in the file (even though they have an ldap... [21:27:55] 10Operations, 10Performance-Team, 10monitoring: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3508201 (10Krinkle) a:03Krinkle [21:28:59] !log T172384: Upgrading Cassandra to 3.11.0-wmf1 in dev environment (build patched to disable in-built heap dumping) [21:29:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:12] T172384: OOM exceptions in dev environment - https://phabricator.wikimedia.org/T172384 [21:35:48] Reedy: Did you deploy anything else on Aug 5 around 14:00 UTC besides the two sole entries at https://wikitech.wikimedia.org/wiki/Server_Admin_Log#2017-08-05 ? [21:36:19] It seems the major regression in backend-save-timing dropped straight back down about 10minutes before that. https://grafana.wikimedia.org/dashboard/db/save-timing?refresh=5m&orgId=1&from=1501903577562&to=1502035738902 [21:36:33] from > 1.5s to <0.5s, major drop [21:37:26] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] [21:43:44] herron: it looks like a native speaker of english programmed it to be "cute", and didn't realize the impact on intelligibility [21:47:17] (sorry, just read https://medium.com/@mollyclare/taming-the-steamroller-how-to-communicate-compassionately-with-non-native-english-speakers-d95d8d1845a0 ) [21:48:26] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [21:59:31] 10Operations, 10Traffic, 10Patch-For-Review, 10User-notice: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) - https://phabricator.wikimedia.org/T147199#2684468 (10Pigsonthewing) > Users which cannot move off of the underlying Windows XP operating system can install the latest Firefox... [22:35:02] 10Operations, 10Traffic, 10Patch-For-Review, 10User-notice: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) - https://phabricator.wikimedia.org/T147199#3508328 (10MaxSem) If a corporation is insane enough to still run XP and force their users to run IE, we can only hope that yet anot... [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170807T2300). [23:04:00] 10Operations, 10Cloud-Services, 10Cloud-VPS: silver has trouble rebooting - https://phabricator.wikimedia.org/T168559#3508388 (10RobH) a:03Andrew @andrew: Since this is slated for decom once the new system is in place, I'm assigning this to you for feedback. Please let me know when this system can be pull... [23:04:22] 10Operations, 10Cloud-Services: decom silver (was silver has trouble rebooting) - https://phabricator.wikimedia.org/T168559#3508390 (10RobH) [23:04:34] 10Operations, 10Cloud-Services, 10hardware-requests: decom silver (was silver has trouble rebooting) - https://phabricator.wikimedia.org/T168559#3368299 (10RobH) [23:08:05] 10Operations, 10Cloud-Services, 10hardware-requests: decom silver (was silver has trouble rebooting) - https://phabricator.wikimedia.org/T168559#3368299 (10Luke081515) Is there already a task for the replacement of silver? [23:11:43] 10Operations, 10Cloud-Services, 10Cloud-VPS: logrotate/disk space on silver for nutcracker log - https://phabricator.wikimedia.org/T120683#3508410 (10RobH) a:03chasemp I'm working on Ops Clinic Duty this week, and as part of that work, I've done a search for unowned, high priority tasks in #Operations. Th... [23:17:30] 10Operations: Something is wrong with installer root disk stuff - https://phabricator.wikimedia.org/T149845#2766226 (10RobH) I'm working on Ops Clinic Duty this week, and as part of that work, I've done a search for unowned, high priority tasks in #Operations. This task came up as a high priority #operations ta... [23:55:00] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Wikidata, and 6 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3508516 (10Jayprakash12345) [23:58:20] 10Operations, 10Domains, 10Traffic, 10Wikimedia Resource Center, 10Patch-For-Review: Create wikimedia.org/resources redirect for Wikimedia Resource Center - https://phabricator.wikimedia.org/T172417#3508519 (10Krinkle)