[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161109T0000). Please do the needful. [00:00:06] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1407 [00:03:13] (03PS1) 10Muehlenhoff: role::mariadb::sanitarium: Restrict to production networks [puppet] - 10https://gerrit.wikimedia.org/r/320545 [00:05:06] RECOVERY - check_mysql on lutetium is OK: Uptime: 108109 Threads: 5 Questions: 5973163 Slow queries: 496 Opens: 500565 Flush tables: 2 Open tables: 64 Queries per second avg: 55.251 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [00:05:36] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [00:09:36] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [00:21:09] (03PS1) 10Muehlenhoff: role::jsbench: Restrict to production networks [puppet] - 10https://gerrit.wikimedia.org/r/320547 [00:25:26] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:36:01] (03CR) 10Alex Monk: [C: 031] tcpircbot: improve firewall rule setup [puppet] - 10https://gerrit.wikimedia.org/r/316497 (owner: 10Dzahn) [00:36:52] (03PS1) 10Gergő Tisza: Revert "Add 'message-format' log channel" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320548 (https://phabricator.wikimedia.org/T146416) [00:37:10] (03PS2) 10Gergő Tisza: Revert "Add 'message-format' log channel" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320548 (https://phabricator.wikimedia.org/T146416) [00:38:55] (03PS1) 10Muehlenhoff: role::mediawiki::jobrunner: Restrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/320549 [00:40:59] (03Abandoned) 10Gergő Tisza: Revert "Add 'message-format' log channel" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320548 (https://phabricator.wikimedia.org/T146416) (owner: 10Gergő Tisza) [00:51:10] (03PS1) 10Gergő Tisza: Add PageViewInfo log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320552 (https://phabricator.wikimedia.org/T129602) [00:51:47] (03PS2) 10Gergő Tisza: Add PageViewInfo log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320552 (https://phabricator.wikimedia.org/T129602) [00:55:26] is jenkins not working? [00:56:40] looks working to me [00:57:39] its completely ignored my change towards grrrit repo [00:59:05] URL? [01:00:57] one moment andre__ [01:01:10] andre__ https://gerrit.wikimedia.org/r/320541 [01:05:06] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1242 [01:10:06] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1542 [01:15:06] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1842 [01:19:21] (03PS1) 10Muehlenhoff: ssh_pybal: Restrict to production networks [puppet] - 10https://gerrit.wikimedia.org/r/320556 [01:20:06] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2142 [01:25:06] RECOVERY - check_mysql on lutetium is OK: Uptime: 112909 Threads: 2 Questions: 6227609 Slow queries: 498 Opens: 509332 Flush tables: 2 Open tables: 64 Queries per second avg: 55.156 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [01:29:03] 06Operations, 10MediaWiki-Sites, 05MW-1.26-release, 13Patch-For-Review, 07SEO: URLs for the same title without extra query parameters should have the same canonical link - https://phabricator.wikimedia.org/T67402#2781732 (10matmarex) I have merged the revert of https://gerrit.wikimedia.org/r/219446 due t... [01:34:45] grrrit-wm-help: help [01:34:57] grrrit-wm: help [01:39:08] (03CR) 10Muehlenhoff: [C: 031] "Looks good. One minor issue: The previous unit had "Restart=on-failure" but the version shipped by the Debian package uses "Restart=always" [puppet] - 10https://gerrit.wikimedia.org/r/320434 (https://phabricator.wikimedia.org/T149992) (owner: 10Filippo Giunchedi) [01:41:59] grrrit-wm: help [01:42:04] My current commands are: grrrit-wm: restart, grrrit-wm: force-restart, and grrrit-wm: nick [02:01:56] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [02:02:06] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:02:36] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [02:04:06] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [02:06:36] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [02:09:36] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [02:17:26] PROBLEM - puppet last run on labvirt1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:24:34] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.1) (duration: 05m 42s) [02:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:35:06] PROBLEM - puppet last run on maps1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:46:26] RECOVERY - puppet last run on labvirt1009 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [02:48:20] (03PS3) 10Alex Monk: Split check_ssl between traditional year-long certs and LE's 3 month certs [puppet] - 10https://gerrit.wikimedia.org/r/313805 (https://phabricator.wikimedia.org/T144293) [02:50:58] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.2) (duration: 10m 39s) [02:51:02] !log uploaded libuv 1.9.0 for jessie-wikimedia to carbon [02:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:56:22] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Nov 9 02:56:22 UTC 2016 (duration 5m 24s) [02:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:03:06] RECOVERY - puppet last run on maps1003 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [03:05:36] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [03:09:36] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [03:26:36] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 701.95 seconds [03:34:43] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 12 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2781773 (10MoritzMuehlenhoff) I've just completed a backport of 6.9.1 for jessie-wikimedia. My build no longer links against the shared library copy of c-ares since it requi... [03:36:36] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 245.84 seconds [03:41:02] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 12 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2781774 (10Yurik) I have tried running Kartotherian (without rebuilding) under nodejs 6.9.1, and had some issues - Kartotherian wouldn't start due to mapnik native library f... [03:42:44] (03PS3) 10Dzahn: remove gallium.wikimedia.org, keep gallium.mgmt [dns] - 10https://gerrit.wikimedia.org/r/318250 (https://phabricator.wikimedia.org/T95757) [03:43:24] (03CR) 10Dzahn: [C: 032] remove gallium.wikimedia.org, keep gallium.mgmt [dns] - 10https://gerrit.wikimedia.org/r/318250 (https://phabricator.wikimedia.org/T95757) (owner: 10Dzahn) [03:44:11] !log gallium.wikimedia.org removed from DNS (T95757) [03:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:44:18] T95757: Phase out gallium.wikimedia.org - https://phabricator.wikimedia.org/T95757 [03:44:50] !log rolling reboots of mw2097-2134 for new kernel [03:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:47:48] !log installing java security updates on meitnerium/archiva [03:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:50:41] 06Operations, 10ops-eqiad, 10Continuous-Integration-Infrastructure (phase-out-gallium): decom gallium (data center) - https://phabricator.wikimedia.org/T150316#2781777 (10Dzahn) [03:51:27] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2781793 (10GWicke) @anomie, I see- "page 1" being the default is indeed one way where there could be fragmentation. It might indeed make sense to make... [03:52:16] 06Operations, 10ops-eqiad, 10Continuous-Integration-Infrastructure (phase-out-gallium): decom gallium (data center) - https://phabricator.wikimedia.org/T150316#2781795 (10Dzahn) [03:52:28] 06Operations, 10ops-eqiad, 10Continuous-Integration-Infrastructure (phase-out-gallium): decom gallium (data center) - https://phabricator.wikimedia.org/T150316#2781777 (10Dzahn) [03:53:31] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2781802 (10Dzahn) [03:58:20] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2781803 (10Dzahn) gallium shudown and removed from DNS. count: 8 [04:00:00] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2781804 (10Dzahn) [04:05:36] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [04:10:36] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [04:12:16] !log rebooting notebook1001/1002 for kernel update [04:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:53:47] mw2111-2134 left to go... [05:05:36] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [05:09:36] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [05:51:36] PROBLEM - puppet last run on mc1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:58:13] mw2125-2134 left to go [06:05:36] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [06:09:38] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [06:19:38] RECOVERY - puppet last run on mc1008 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:34:56] PROBLEM - NTP on mw2128 is CRITICAL: NTP CRITICAL: Offset unknown [06:47:08] and last of the mws are done [06:52:03] thanks apergos! [06:52:18] in a little while I shall be... [06:52:22] trying to get some sleep >_< [06:52:38] and then be back for midday through evening [07:06:36] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [07:08:36] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [07:10:26] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [07:11:16] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3057138 keys, up 8 days 22 hours - replication_delay is 0 [07:30:36] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [07:30:56] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [08:05:36] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [08:09:36] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [08:13:46] PROBLEM - puppet last run on cp2025 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[generate_varnishkafka_webrequest_gmond_pyconf] [08:16:15] !log restarted ntp on mw2128 (was stuck in XFAC state) [08:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:10] (03CR) 10Elukey: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/320361 (https://phabricator.wikimedia.org/T147773) (owner: 10Elukey) [08:23:20] (03PS4) 10Elukey: Raise nagios retry_interval to avoid false alarms for HHVM restarts [puppet] - 10https://gerrit.wikimedia.org/r/320361 (https://phabricator.wikimedia.org/T147773) [08:25:05] !log rolling reboot of logstash1002/1003 for kernel update [08:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:38] (03CR) 10Elukey: [C: 032] Raise nagios retry_interval to avoid false alarms for HHVM restarts [puppet] - 10https://gerrit.wikimedia.org/r/320361 (https://phabricator.wikimedia.org/T147773) (owner: 10Elukey) [08:35:00] RECOVERY - NTP on mw2128 is OK: NTP OK: Offset -0.0003388226032 secs [08:42:50] RECOVERY - puppet last run on cp2025 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [08:43:46] !log rebooting bast1001 for kernel update [08:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:16] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2781961 (10Tgr) The reason the current thumb URL format does not allow larger-than-original (and in most cases same-as-original) sizes is to avoid cach... [09:06:04] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 12 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2781963 (10MoritzMuehlenhoff) >>! In T149331#2781773, @MoritzMuehlenhoff wrote: > I've just completed a backport of 6.9.1 for jessie-wikimedia. My build no longer links agai... [09:07:42] (03PS2) 10Jdlrobson: Cleanup unused config variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317868 (https://phabricator.wikimedia.org/T148853) [09:11:58] PROBLEM - Host labsdb1009 is DOWN: PING CRITICAL - Packet loss = 100% [09:30:58] 06Operations, 06Analytics-Kanban, 10Traffic: Varnishlog with Start timestamp but no Resp one causing data consistency check alarms - https://phabricator.wikimedia.org/T148412#2781980 (10elukey) We had a discussion on #wikimedia-traffic about this and the Analytics team completely agrees with what Brandon sai... [09:35:14] !log rebooting kafka1018 for kernel + openjdk upgrade [09:35:28] !log rebooting hydrogen for kernel update [09:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:15] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:36:15] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:36:15] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:36:15] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:36:25] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:36:25] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:36:25] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:36:25] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:36:55] PROBLEM - puppet last run on kubernetes1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:37:05] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:37:05] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:37:05] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:37:05] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:37:05] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:37:15] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:37:15] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:37:25] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [09:37:35] PROBLEM - PyBal backends health check on lvs1009 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1002.eqiad.wmnet because of too many down! [09:38:15] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [09:38:35] RECOVERY - PyBal backends health check on lvs1009 is OK: PYBAL OK - All pools are healthy [09:38:55] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [09:38:55] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [09:38:55] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [09:38:55] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [09:39:05] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [09:39:05] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [09:39:05] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [09:39:05] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [09:39:05] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [09:39:06] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [09:39:10] what the.. [09:39:15] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [09:39:15] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [09:39:25] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [09:40:51] (03PS2) 10Jcrespo: role::mariadb::sanitarium: Restrict to production networks [puppet] - 10https://gerrit.wikimedia.org/r/320545 (owner: 10Muehlenhoff) [09:41:16] should not be related to the hydrogen reboot, the pdns_recursor service was depooled before the boot and all servers include multiple ntp servers for redundancy (and that wouldn't explain that kind of failure either?) [09:41:49] seemed like a network glitch [09:42:02] yep [09:43:03] (03CR) 10Jcrespo: [C: 032] role::mariadb::sanitarium: Restrict to production networks [puppet] - 10https://gerrit.wikimedia.org/r/320545 (owner: 10Muehlenhoff) [09:45:50] (03CR) 10Alexandros Kosiaris: [C: 04-1] tcpircbot: improve firewall rule setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/316497 (owner: 10Dzahn) [09:48:35] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [09:48:57] (03PS5) 10Jcrespo: Depool db2042 for reimage + upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320392 [09:50:35] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [09:52:55] PROBLEM - Check size of conntrack table on kafka1018 is CRITICAL: CRITICAL: nf_conntrack is 99 % full [09:53:34] ^fixing [09:53:37] thanks! [09:53:40] I was about to do it [09:54:35] PROBLEM - puppet last run on kafka1018 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[tshark],Package[tmux] [09:54:41] I restarted Mirror Maker on kafka1012, for some reason it stops when other brokers are down for a bit (so after some produce attempt failures). [09:54:55] RECOVERY - Check size of conntrack table on kafka1018 is OK: OK: nf_conntrack is 51 % full [09:55:46] !log rebooting maerlant for kernel update [09:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:39] (03CR) 10Jcrespo: [C: 032] Depool db2042 for reimage + upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320392 (owner: 10Jcrespo) [09:57:32] (03PS2) 10Thiemo Mättig (WMDE): Add missing $wgPropertySuggesterClassifyingPropertyIds for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320192 [09:59:05] PROBLEM - Recursive DNS on 91.198.174.122 is CRITICAL: CRITICAL - Plugin timed out while executing system call [09:59:55] RECOVERY - Recursive DNS on 91.198.174.122 is OK: DNS OK: 0.095 seconds response time. www.wikipedia.org returns 91.198.174.192 [09:59:56] (03CR) 10Thiemo Mättig (WMDE): "This causes an actual bug on beta. Go to https://wikidata.beta.wmflabs.org/wiki/Q12 and try to add a new statement for a new property. You" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320192 (owner: 10Thiemo Mättig (WMDE)) [10:03:16] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2042 (duration: 00m 48s) [10:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:52] !log upgrading cache_text ulsfo to varnish 4.1.3-1wm3 T150247 [10:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:58] T150247: Varnish4 is unexpectedly retrying certain applayer failure cases - https://phabricator.wikimedia.org/T150247 [10:05:35] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [10:05:55] RECOVERY - puppet last run on kubernetes1004 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [10:09:35] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [10:11:44] !log stopping and reimaging db2042 for upgrade [10:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:28] !log rebooting nescio for kernel update [10:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:35] RECOVERY - puppet last run on kafka1018 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [10:23:55] PROBLEM - Host 91.198.174.106 is DOWN: PING CRITICAL - Packet loss = 100% [10:23:55] PROBLEM - Host 2620:0:862:1:91:198:174:106 is DOWN: PING CRITICAL - Packet loss = 100% [10:25:35] RECOVERY - Host 91.198.174.106 is UP: PING OK - Packet loss = 0%, RTA = 83.81 ms [10:25:55] RECOVERY - Host 2620:0:862:1:91:198:174:106 is UP: PING OK - Packet loss = 0%, RTA = 83.81 ms [10:28:04] ah that was nescio [10:33:47] !log rebooting kafka1020 for kernel and openjdk upgrades [10:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:32] akosiaris: good morning! We are missing Trusty packages for apertium, but I am not sure we still run trusty in prod [10:35:47] (puppet fails on deployment-apertium01 which is trusty ) [10:36:47] hashar: yeah, we 've migrated apertium to jessie very recently [10:37:06] I 'll kill the VM and update whatever proxy entries refer to it [10:37:12] 07Puppet, 10Beta-Cluster-Infrastructure, 07Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2782042 (10hashar) [10:37:15] 07Puppet, 10Beta-Cluster-Infrastructure: deployment-apertium01 puppet failing due to missing packages on trusty - https://phabricator.wikimedia.org/T147210#2782040 (10hashar) 05Resolved>03Open deployment-apertium01 (Trusty) has a bunch of missing packages again eg: ``` # tail -n 180 /var/log/puppet.log|eg... [10:37:22] I could not find config files referencing the instance :( [10:38:01] !log change-prop deploying e0040ac [10:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:13] akosiaris: thanks :) [10:38:14] 07Puppet, 10Beta-Cluster-Infrastructure: deployment-apertium01 puppet failing due to missing packages on trusty - https://phabricator.wikimedia.org/T147210#2782043 (10hashar) Instance IP address is `10.68.16.79` @akosiaris confirmed production moved to Jessie. So I guess we can delete deployment-apertium01 an... [10:38:47] they use apertium-beta.wmflabs.org IIRC [10:38:57] which is now a proxy to http://10.68.22.254:2737 [10:39:31] so it points to deployment-apertium02 [10:39:39] I guess we can just delete deployment-apertium01 [10:39:47] kart_: ^ ? is deleting deployment-apertium01 ok ? [10:42:25] moritzm: fixed nfconntrack related sysctl on kafka1020 [10:43:22] ok! [10:50:20] (03PS1) 10Muehlenhoff: Restrict access to Hive server [puppet] - 10https://gerrit.wikimedia.org/r/320574 [10:52:31] !log restarting kafka* on kafka1013 for openjkd upgrades [10:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:55] (03CR) 10Mobrovac: [C: 04-1] "Somehow the mariadb module commit change ended up here." [puppet] - 10https://gerrit.wikimedia.org/r/320529 (owner: 10Ppchelko) [11:05:35] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [11:06:15] 07Puppet, 10Beta-Cluster-Infrastructure: deployment-apertium01 puppet failing due to missing packages on trusty - https://phabricator.wikimedia.org/T147210#2782075 (10akosiaris) @KartikMistry I am pretty sure we can delete that VM, could you please confirm ? [11:08:46] !log contint1001 apt-get upgrade packages and purging unneeded ones (left over from a puppet manifest that is no more applied) [11:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:35] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [11:09:44] !log finished upgrading cache_text ulsfo to varnish 4.1.3-1wm3 T150247 [11:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:49] T150247: Varnish4 is unexpectedly retrying certain applayer failure cases - https://phabricator.wikimedia.org/T150247 [11:10:34] !log rebooting mw1162 for kernel update [11:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:47] 07Puppet, 10Beta-Cluster-Infrastructure: deployment-apertium01 puppet failing due to missing packages on trusty - https://phabricator.wikimedia.org/T147210#2782079 (10KartikMistry) @akosiaris @hashar Yes. We can delete deployment-apertium01. [11:11:14] hashar: ^ [11:11:21] I 'll remove it [11:11:48] \O/ [11:12:22] 07Puppet, 10Beta-Cluster-Infrastructure, 07Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2782085 (10akosiaris) [11:12:25] 07Puppet, 10Beta-Cluster-Infrastructure: deployment-apertium01 puppet failing due to missing packages on trusty - https://phabricator.wikimedia.org/T147210#2782083 (10akosiaris) 05Open>03Resolved Instance terminated. [11:12:44] 06Operations, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2782087 (10Joe) @Tgr to be very clear, I'm ok with both the first two options you cited; what I want, from an ops prepsec... [11:17:40] !log CI gate for MediaWiki fails tests. On it. See https://phabricator.wikimedia.org/T150323 [11:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:45] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] [11:24:45] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] [11:42:27] Is mw1260 down? [11:43:25] PROBLEM - puppet last run on wtp1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:45:50] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 12 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2782147 (10mobrovac) >>! In T149331#2781774, @Yurik wrote: > I have tried running Kartotherian (without rebuilding) under nodejs 6.9.1, and had some issues - Kartotherian wo... [11:48:55] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [11:49:55] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 2999751 keys, up 9 days 3 hours - replication_delay is 0 [11:50:55] Revent: no, it's up and running [11:53:11] (03PS1) 10Alexandros Kosiaris: icinga: Increase the number of concurrent checks [puppet] - 10https://gerrit.wikimedia.org/r/320582 [11:55:22] (03CR) 10Alexandros Kosiaris: [C: 032] icinga: Increase the number of concurrent checks [puppet] - 10https://gerrit.wikimedia.org/r/320582 (owner: 10Alexandros Kosiaris) [11:55:29] (03PS2) 10Alexandros Kosiaris: icinga: Increase the number of concurrent checks [puppet] - 10https://gerrit.wikimedia.org/r/320582 [11:55:47] (03CR) 10Alexandros Kosiaris: [V: 032] icinga: Increase the number of concurrent checks [puppet] - 10https://gerrit.wikimedia.org/r/320582 (owner: 10Alexandros Kosiaris) [11:57:06] moritzm: https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&c=Video+scalers+eqiad&h=&tab=m&vn=&hide-hf=false&m=cpu_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [12:06:36] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [12:06:45] (03PS2) 10Alexandros Kosiaris: icinga: Increase the check intervals [puppet] - 10https://gerrit.wikimedia.org/r/319062 [12:06:49] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] icinga: Increase the check intervals [puppet] - 10https://gerrit.wikimedia.org/r/319062 (owner: 10Alexandros Kosiaris) [12:08:36] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [12:09:52] 06Operations, 10Monitoring, 13Patch-For-Review, 15User-Joe: Huge log files on icinga machines - https://phabricator.wikimedia.org/T150061#2782170 (10akosiaris) https://gerrit.wikimedia.org/r/#/c/320582/ increase the max concurrent checks to 5000, which einsteinium seems to have no problem handling (which w... [12:11:26] RECOVERY - puppet last run on wtp1015 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [12:14:23] !log CI gate for MediaWiki is back. Reverted an oojs-ui version bump that triggered tests failure but was not caught properly by CI. T150323 [12:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:30] T150323: OOjs v0.18.0 not in mediawiki/core and VE QUnit tests fails when trying to use it - https://phabricator.wikimedia.org/T150323 [12:28:29] 06Operations, 10ops-codfw, 10DBA: db2034 crashes meta ticket - https://phabricator.wikimedia.org/T150233#2782175 (10Marostegui) Papaul has opened a case: https://phabricator.wikimedia.org/T149553#2780459 - I will update the initial description [12:28:36] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 12 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2782174 (10MoritzMuehlenhoff) > I'll refrain from uploading this to apt.wikimedia.org until the karthotherian problems are sorted out. In the mean time I've copied the pack... [12:29:37] 06Operations, 10ops-codfw, 10DBA: db2034 crashes meta ticket - https://phabricator.wikimedia.org/T150233#2782176 (10Marostegui) [12:40:32] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [12:54:09] from https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes it seems a brief spike in 503 for one data point, now gone [12:54:33] akosiaris: deployment from tin is OK, right? [12:56:24] (03PS1) 10Jcrespo: install_server: Install jessie on db1042 [puppet] - 10https://gerrit.wikimedia.org/r/320587 [12:57:55] !log rebooting kafka1022 for kernel + openjdk updates [12:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:27] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:01:43] (03CR) 10Jcrespo: [C: 032] install_server: Install jessie on db1042 [puppet] - 10https://gerrit.wikimedia.org/r/320587 (owner: 10Jcrespo) [13:02:58] !log Update cxserver to 17f9deb [13:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:37] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [13:05:37] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [13:13:17] PROBLEM - Check size of conntrack table on kafka1022 is CRITICAL: CRITICAL: nf_conntrack is 100 % full [13:13:34] fixing --^ [13:14:43] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 12 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2782209 (10KartikMistry) I suggest we should test new Node in Beta cluster first. [13:15:17] RECOVERY - Check size of conntrack table on kafka1022 is OK: OK: nf_conntrack is 53 % full [13:17:57] PROBLEM - NTP on kafka1022 is CRITICAL: NTP CRITICAL: Offset unknown [13:21:48] (03PS2) 10Ema: cache_text esams: route to codfw [puppet] - 10https://gerrit.wikimedia.org/r/320180 (https://phabricator.wikimedia.org/T131503) [13:21:58] (03CR) 10Ema: [C: 032 V: 032] cache_text esams: route to codfw [puppet] - 10https://gerrit.wikimedia.org/r/320180 (https://phabricator.wikimedia.org/T131503) (owner: 10Ema) [13:22:55] moritzm: sorry I am a bit ignorant about ntp, but timedatectl status shows "NTP enabled: no" [13:23:02] for kafka1022 [13:23:05] that seems weird [13:23:41] (03CR) 10Gehel: Switch discovery-stats cronjob to a dedicated script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/319252 (https://phabricator.wikimedia.org/T149722) (owner: 10MaxSem) [13:24:09] ah same thing on kafka1022 [13:24:13] *1020 [13:24:24] (03PS1) 10Ema: Revert "cache_text esams: route to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/320589 [13:25:58] (03CR) 10Ema: [C: 032] Revert "cache_text esams: route to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/320589 (owner: 10Ema) [13:27:11] let me check [13:27:57] RECOVERY - NTP on kafka1022 is OK: NTP OK: Offset 6.908178329e-05 secs [13:28:31] seems, it always needs a bit to resync after a reboot [13:28:36] seems fine, it always needs a bit to resync after a reboot [13:28:50] ah okok [13:29:03] I restarted ntpd a while ago and checked timedatectl [13:29:32] but I didn't get the "offset unknown" [13:30:41] anyhow, when it is in sync and running it should be ok right? [13:30:50] or should I check more things? [13:31:01] (maybe ntpq -p stats?) [13:33:52] yeah, "ntpq -p" is correct [13:34:32] ahhh now I get the offset [13:34:42] okok great, thanks :) [13:35:29] !log stopping kafka* daemons on kafka1012 to upgrade its fstab with UUID (T147879) [13:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:36] T147879: Audit fstabs on Kafka and Hadoop nodes to use UUIDs instead of /dev paths - https://phabricator.wikimedia.org/T147879 [13:35:37] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [13:38:01] (03PS1) 10Muehlenhoff: Configure connection tracking sysctl settings in ferm [puppet] - 10https://gerrit.wikimedia.org/r/320590 (https://phabricator.wikimedia.org/T136094) [13:41:21] (03CR) 10Faidon Liambotis: [C: 04-1] "Why the "_traditional"? Because of it there is a lot of noise in a commit that would otherwise be simple :)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/313805 (https://phabricator.wikimedia.org/T144293) (owner: 10Alex Monk) [13:42:11] sigh.. the puppetmaster1001 thing is result of a commit and then a revert. puppet-merge does if [ -z "$(git diff HEAD..${fetch_head_sha1})" -a -z "${submodule_changes}" ]; then and hence sees no real diff and just says "No changes to merge." [13:43:05] akosiaris: ugh, that's my fault [13:44:25] 06Operations, 10Icinga, 10Monitoring: Monitor all mgmt hosts - https://phabricator.wikimedia.org/T85143#2782267 (10akosiaris) [13:44:27] 06Operations, 10Monitoring: icinga "max concurrent checks" limits reached - https://phabricator.wikimedia.org/T1242#2782268 (10akosiaris) [13:44:29] 06Operations, 10Icinga: improve icinga performance / solve general load issues on neon - https://phabricator.wikimedia.org/T85222#2782265 (10akosiaris) 05stalled>03Resolved This has more or less been resolved for now, mostly due to the migration of neon to einsteinium and the deprecation of check_sslXNN. [13:45:00] akosiaris: the original change was actually OK and I shouldn't have reverted it do you think it'd be enough to just reintroduce it to fix the issue? [13:45:55] (03PS3) 10Dzahn: remove neon from puppet/netboot/dhcp/network [puppet] - 10https://gerrit.wikimedia.org/r/318437 (https://phabricator.wikimedia.org/T125023) [13:46:09] ema: yes [13:46:34] (03CR) 10Alexandros Kosiaris: [C: 031] remove neon from puppet/netboot/dhcp/network [puppet] - 10https://gerrit.wikimedia.org/r/318437 (https://phabricator.wikimedia.org/T125023) (owner: 10Dzahn) [13:46:42] !log rebooting kafka1012 for kernel and openjdk updates [13:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:52] (03CR) 10Alexandros Kosiaris: [C: 031] remove neon.wikimedia.org, keep neon.mgmt [dns] - 10https://gerrit.wikimedia.org/r/318444 (https://phabricator.wikimedia.org/T125023) (owner: 10Dzahn) [13:47:01] (03PS1) 10Ema: Revert "Revert "cache_text esams: route to codfw"" [puppet] - 10https://gerrit.wikimedia.org/r/320591 [13:48:29] 06Operations, 10Monitoring, 13Patch-For-Review: Fix up icinga puppetization - https://phabricator.wikimedia.org/T110893#2782295 (10akosiaris) 05Open>03Resolved a:03akosiaris I am going to resolve this. Most of these issues have been fixed (along with others) either in https://gerrit.wikimedia.org/r/#/... [13:48:38] (03CR) 10Ema: [C: 032] Revert "Revert "cache_text esams: route to codfw"" [puppet] - 10https://gerrit.wikimedia.org/r/320591 (owner: 10Ema) [13:48:43] 06Operations, 10ops-eqiad, 10DBA, 06Labs, and 3 others: Move dbproxy1010 and dbproxy1011 to labs-support network, rename them to labsdbproxy1001 and labsdbproxy1002 - https://phabricator.wikimedia.org/T149170#2782298 (10mark) [13:49:34] 06Operations, 10ops-eqiad, 10hardware-requests, 10netops, 13Patch-For-Review: Move labsdb1008 to production, rename it back to db1095, use it as a temporary sanitarium - https://phabricator.wikimedia.org/T149829#2782302 (10mark) [13:49:37] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [13:50:25] akosiaris: fixed now, thanks [13:51:32] :-) [13:51:40] hashar: ready for eu swat? :) [13:51:57] looks like there are three relatively simple config changes [13:52:19] jouncebot: next [13:52:19] In 0 hour(s) and 7 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161109T1400) [13:53:38] zeljkof: yup straight forward. Wanna handle it? [13:54:03] hashar: sure, looks easy enough, even I could do it :D [13:58:24] zeljkof: please do :) [13:58:38] hashar: deal [13:58:38] !log stopping kafka* daemons on kafka1014 to upgrade its fstab with UUID (T147879) [13:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:45] T147879: Audit fstabs on Kafka and Hadoop nodes to use UUIDs instead of /dev paths - https://phabricator.wikimedia.org/T147879 [13:58:46] hashar: uh oh [13:58:49] "ECDSA host key for mw1099.eqiad.wmnet has changed and you have requested strict checking." [13:59:46] anybody knows if the key for mw1099 has changed? [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161109T1400). [14:00:04] dcausse and jdlrobson: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:00:24] o/ [14:00:31] I can SWAT today! [14:00:53] \o [14:00:53] jdlrobson: around for swat? [14:01:16] ok, everything is ready, but I am getting this error messages [14:01:17] coffee break, will be back in ~ 10 mins [14:01:27] "ECDSA host key for deployment.eqiad.wmnet has changed and you have requested strict checking." [14:01:37] can anybody confirm that the keys have really changed? [14:02:01] 06Operations, 10ops-eqiad, 10hardware-requests, 10netops, 13Patch-For-Review: Move labsdb1008 to production, rename it back to db1095, use it as a temporary sanitarium - https://phabricator.wikimedia.org/T149829#2764445 (10mark) labsdb1008 is in rack C3 right now. We can simply move it to vlan private1-c... [14:02:21] grab the key from tin [14:02:31] $ grep mw1099 /etc/ssh/ssh_known_hosts [14:02:37] and copy paste to your ssh_know_hosts :D [14:03:14] hashar: but how do I connect when I am getting this error messages? [14:03:33] maybe that is the bastion [14:03:55] just accept and proceed ? :D [14:04:01] coffee time here [14:04:25] https://phabricator.wikimedia.org/P4397 [14:04:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [14:04:50] hashar: ok, if you think it is fine to accept new keys, doing that [14:05:30] PROBLEM - puppet last run on labvirt1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:08:04] (03PS2) 10Ema: cache_text: upgrade esams to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/320181 (https://phabricator.wikimedia.org/T131503) [14:08:11] (03CR) 10Ema: [C: 032 V: 032] cache_text: upgrade esams to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/320181 (https://phabricator.wikimedia.org/T131503) (owner: 10Ema) [14:08:59] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1014 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [14:09:41] this is me sorry --^ [14:09:53] (03CR) 10Ottomata: [C: 04-1] "Let's use the analytics network here. Spark masters (which can run on any Hadoop node) may talk with the hive server." [puppet] - 10https://gerrit.wikimedia.org/r/320574 (owner: 10Muehlenhoff) [14:10:03] dcausse and jdlrobson: apologies for the delay, key problems, fixed, starting swat [14:10:11] sure, np [14:10:44] dcausse: can we test your patches at mw1099, or should I deploy to cluster immediately? [14:11:00] (03PS4) 10Zfilipin: [cirrus] Increase the number of shards to 15 for commonswiki_file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316964 (https://phabricator.wikimedia.org/T148736) (owner: 10DCausse) [14:11:15] !log upgrading cp3030 (text-esams) to varnish 4 -- T131503 [14:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:23] T131503: Convert text cluster to Varnish 4 - https://phabricator.wikimedia.org/T131503 [14:11:39] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [14:11:39] zeljkof: yes I think so [14:12:02] ok, in that case deploying to mw1099 first, will let you know when the first patch is there [14:12:42] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316964 (https://phabricator.wikimedia.org/T148736) (owner: 10DCausse) [14:13:04] (03PS2) 10Muehlenhoff: Restrict access to Hive server [puppet] - 10https://gerrit.wikimedia.org/r/320574 [14:13:15] dcausse: for https://gerrit.wikimedia.org/r/#/c/316964 is it important in which order the files are deployed? [14:13:19] (03Merged) 10jenkins-bot: [cirrus] Increase the number of shards to 15 for commonswiki_file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316964 (https://phabricator.wikimedia.org/T148736) (owner: 10DCausse) [14:13:20] !log rebooting kafka1014.eqiad.wmnet for kernel and openjdk upgrades [14:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:39] (03CR) 10Ottomata: [C: 031] Restrict access to Hive server [puppet] - 10https://gerrit.wikimedia.org/r/320574 (owner: 10Muehlenhoff) [14:13:49] (03CR) 10Ottomata: "Commit message needs fixed though! :)" [puppet] - 10https://gerrit.wikimedia.org/r/320574 (owner: 10Muehlenhoff) [14:13:54] dcausse: tests/cirrusTest.php then wmf-config/InitialiseSettings.php? [14:14:08] zeljkof: no, the second file is just a unit test that is not read by prod [14:14:18] I mean whatever order should work [14:14:50] dcausse: great, 316964 is merged, pushing to mw1099 in a minute [14:16:00] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1014 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [14:17:39] dcausse: 316964 is at mw1099, please test and let me know if I can push to cluster [14:17:48] zeljkof: testing [14:18:27] moritzm: I see : PHP Fatal error: Class 'Memcached' not found in /srv/mediawiki/php-1.29.0-wmf.1/includes/libs/objectcache/MemcachedPeclBagOStuff.php [14:18:32] on mw1099 [14:18:55] running mwscript [14:19:02] eeek [14:19:10] dcausse: which machine ? [14:19:14] mw1099 [14:19:24] that one got reimaged hasn't it ? [14:19:38] jessie app servers no more includes Zend extensions such as php-memcached [14:19:59] and mwscript is hardcoded to use php5 (Zend) [14:20:01] BUT [14:20:08] the work machines such as terbium do have the zend packages [14:20:13] I think so it happened on wasat.codfw.wmnet and I remeber that moritzm installed a package to fix it [14:20:20] so maybe you can scap pull on terbium and run mwscript there [14:20:24] or wasat yeah [14:20:31] that one probably has less script running on [14:20:53] one day we will make mwscript to use php, and hence hhvm. I can't remember offhand what was the issue but there is a task for it [14:21:06] let me have a look [14:21:07] well that was just to test the config var with eval, I think I can see them from the website [14:21:30] (03PS4) 10Zfilipin: [cirrus] Activate BM25 on top 10 wikis: Step 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318356 (https://phabricator.wikimedia.org/T147508) (owner: 10DCausse) [14:22:37] dcausse: it's missing php5-memcached [14:23:11] which is "normal" :D [14:23:59] it's only installed via mediawiki::packages::php5 on the deployment servers and script runners [14:24:10] (and contint/openstack) [14:24:51] does it mean that we can't run mwscript on mw1099? [14:25:18] dcausse, hashar, moritzm: what do we do with patch 316964? revert? deploy to cluster? [14:25:28] zeljkof: all good, testing with a custom api call and values are properly updated [14:25:50] dcausse: should I deploy to cluster? [14:25:56] zeljkof: yes [14:26:04] ok, deploying, will let you know when done [14:28:19] !log zfilipin@tin Synchronized tests/cirrusTest.php: SWAT: [[gerrit:316964|[cirrus] Increase the number of shards to 15 for commonswiki_file (T148736)]] (duration: 00m 58s) [14:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:26] T148736: Write errors on elasticsearch eqiad - https://phabricator.wikimedia.org/T148736 [14:29:12] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2782384 (10Gilles) >>! In T66214#2781560, @Anomie wrote: > Say we didn't already have a parameter to select the page of a PDF. Then we add the paramete... [14:29:33] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:316964|[cirrus] Increase the number of shards to 15 for commonswiki_file (T148736)]] (duration: 00m 49s) [14:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:46] zeljkof: thanks! [14:30:00] dcausse: 316964 deployed, please check [14:30:07] zeljkof: all good [14:30:30] dcausse: working on 318356 now, first step rebase+merge [14:31:09] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318356 (https://phabricator.wikimedia.org/T147508) (owner: 10DCausse) [14:31:24] 06Operations, 10Electron-PDFs, 10Security-Reviews, 06Services (blocked), 15User-mobrovac: Productize the Electron PDF render service & create a REST API end point - https://phabricator.wikimedia.org/T142226#2782395 (10mobrovac) [14:31:36] dcausse: can 318356 be tested on mw1099? [14:31:40] zeljkof: yes [14:32:04] (03Merged) 10jenkins-bot: [cirrus] Activate BM25 on top 10 wikis: Step 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318356 (https://phabricator.wikimedia.org/T147508) (owner: 10DCausse) [14:32:22] !log upgrading cp3031 (text-esams) to varnish 4 -- T131503 [14:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:28] T131503: Convert text cluster to Varnish 4 - https://phabricator.wikimedia.org/T131503 [14:33:29] RECOVERY - puppet last run on labvirt1014 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [14:34:22] dcausse: 318356 is on mw1099, please test [14:34:34] testing [14:34:47] jdlrobson: ready for swat? you are next :) [14:34:51] yep [14:34:53] should be a quick one [14:34:57] (03PS3) 10Zfilipin: Cleanup unused config variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317868 (https://phabricator.wikimedia.org/T148853) (owner: 10Jdlrobson) [14:35:06] jdlrobson: can it be tested at mw1099? [14:35:09] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp3031 is CRITICAL: connect to address 10.20.0.166 and port 3122: Connection refused [14:35:16] not really. It should be a noop [14:35:32] so basically test is does anything throw exceptions when you remove them [14:35:51] there shouldn't be any references to these variables so that shouldnt happen :) [14:36:05] jdlrobson: ok, in that case deploying to cluster immediately when merged, will ping you when ready, as soon as I am finished with 318356 [14:36:09] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp3031 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.168 second response time [14:36:11] sounds good! [14:36:17] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2782399 (10Gilles) >>! In T66214#2781793, @GWicke wrote: > @anomie, I see- "page 1" being the default is indeed one way where there could be fragmentat... [14:38:26] zeljkof: looks good, fyi we may see a short spike of poolcounter errors [14:39:23] dcausse: ok, deploying [14:41:59] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:318356|[cirrus] Activate BM25 on top 10 wikis: Step 3 (T147508)]] (duration: 00m 48s) [14:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:04] T147508: BM25: initial limited release into production - https://phabricator.wikimedia.org/T147508 [14:42:12] dcausse: deployed, please check [14:42:18] zeljkof: looking [14:42:34] jdlrobson: working on 318356, will ping you when deployed [14:43:00] jdlrobson: sorry, working on 317868, will ping you when deployed :) [14:43:16] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317868 (https://phabricator.wikimedia.org/T148853) (owner: 10Jdlrobson) [14:43:59] (03Merged) 10jenkins-bot: Cleanup unused config variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317868 (https://phabricator.wikimedia.org/T148853) (owner: 10Jdlrobson) [14:44:45] zeljkof: elastic@eqiad is now getting 100% of the traffic and no errors so far. I think it's ok, thanks! [14:44:54] dcausse: great! [14:46:10] jdlrobson: is the order in which files are deployed important? [14:46:58] zeljkof: the order? [14:47:25] there are two files, is it important to deploy one before the other? [14:47:32] wmf-config/InitialiseSettings-labs.php is probably the safest to start with [14:47:37] since it should only impact labs [14:47:47] jdlrobson: ok, will start there [14:48:05] deploying it [14:48:43] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: [[gerrit:317868|Cleanup unused config variables (T148853)]] (duration: 00m 47s) [14:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:50] T148853: Cleanup removed mediawiki config variables for new language switcher and close epic task - https://phabricator.wikimedia.org/T148853 [14:49:13] jdlrobson: InitialiseSettings-labs.php is deployed, can you check if everything looks fine? [14:49:36] should I deploy the other file immediately, or wait for you to check if the first one broke anything? [14:51:21] zeljkof: LGTM just checking logs [14:51:35] jdlrobson: can I deploy the other file too? [14:51:41] zeljkof: yup [14:52:30] deploying... [14:53:11] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:317868|Cleanup unused config variables (T148853)]] (duration: 00m 48s) [14:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:34] jdlrobson: everything deployed, please check [14:54:07] LGTM thanks zeljkof [14:55:46] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 13Patch-For-Review: Elasticsearch logs are not send to logstash after 2.3.3 upgrade - https://phabricator.wikimedia.org/T136696#2782448 (10Gehel) Issue opened upstream to see if we can get some external help: https://github.com/elastic/elast... [14:55:58] jdlrobson: great! [14:56:01] in that case... [14:56:10] !log EU SWAT finished [14:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:18] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2782449 (10Anomie) >>! In T66214#2781793, @GWicke wrote: > @anomie, I see- "page 1" being the default is indeed one way where there could be fragmentat... [14:58:24] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2782454 (10GWicke) > Currently the Varnish task seem to only focus on parameter reordering, when this is already an early sign that as the URL scheme e... [14:58:51] !log upgrading cp3032 (text-esams) to varnish 4 -- T131503 [14:58:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:57] T131503: Convert text cluster to Varnish 4 - https://phabricator.wikimedia.org/T131503 [14:59:59] !log clear zero sized log files on logstash* (leftover from disk space issues) [15:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [15:07:20] !log restarting Jenkins (java update) [15:07:23] moritzm: ^^:) [15:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:39] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [15:11:15] !log upgrading cp3033 (text-esams) to varnish 4 -- T131503 [15:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:23] T131503: Convert text cluster to Varnish 4 - https://phabricator.wikimedia.org/T131503 [15:12:59] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [15:13:59] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3000865 keys, up 9 days 6 hours - replication_delay is 0 [15:16:15] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2782495 (10Joe) @Andrew will do. Just to be sure, what's the way you extract your parameters for roles in the horizon UI? Do you have a pre-baked pars... [15:18:09] PROBLEM - puppet last run on db1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:20:04] (03PS6) 10Ottomata: eventstreams puppetization [puppet] - 10https://gerrit.wikimedia.org/r/317981 (https://phabricator.wikimedia.org/T148779) [15:21:13] (03CR) 10jenkins-bot: [V: 04-1] eventstreams puppetization [puppet] - 10https://gerrit.wikimedia.org/r/317981 (https://phabricator.wikimedia.org/T148779) (owner: 10Ottomata) [15:21:15] (03PS7) 10Ottomata: eventstreams puppetization [puppet] - 10https://gerrit.wikimedia.org/r/317981 (https://phabricator.wikimedia.org/T148779) [15:22:56] (03PS8) 10Ottomata: eventstreams puppetization [puppet] - 10https://gerrit.wikimedia.org/r/317981 (https://phabricator.wikimedia.org/T148779) [15:23:16] !log upgrading cp3040 (text-esams) to varnish 4 -- T131503 [15:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:22] T131503: Convert text cluster to Varnish 4 - https://phabricator.wikimedia.org/T131503 [15:23:36] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2782563 (10Anomie) >>! In T66214#2782384, @Gilles wrote: > In regards to the client cache fragmentation, I think most clients are consistent with thems... [15:24:51] (03CR) 10jenkins-bot: [V: 04-1] eventstreams puppetization [puppet] - 10https://gerrit.wikimedia.org/r/317981 (https://phabricator.wikimedia.org/T148779) (owner: 10Ottomata) [15:25:22] (03PS1) 10Jcrespo: Depool db2048 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320608 (https://phabricator.wikimedia.org/T150334) [15:26:08] (03PS2) 10Jcrespo: Depool db2048 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320608 (https://phabricator.wikimedia.org/T150334) [15:26:37] (03PS9) 10Ottomata: eventstreams puppetization [puppet] - 10https://gerrit.wikimedia.org/r/317981 (https://phabricator.wikimedia.org/T148779) [15:27:29] (03CR) 10Jcrespo: [C: 032] Depool db2048 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320608 (https://phabricator.wikimedia.org/T150334) (owner: 10Jcrespo) [15:28:08] (03Merged) 10jenkins-bot: Depool db2048 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320608 (https://phabricator.wikimedia.org/T150334) (owner: 10Jcrespo) [15:28:26] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2782611 (10Anomie) >>! In T66214#2782454, @GWicke wrote: > The bigger issue I see with query strings is that (contrary to what I thought earlier) all c... [15:30:00] (03CR) 10Ottomata: [C: 032] eventstreams puppetization [puppet] - 10https://gerrit.wikimedia.org/r/317981 (https://phabricator.wikimedia.org/T148779) (owner: 10Ottomata) [15:30:07] (03PS10) 10Ottomata: eventstreams puppetization [puppet] - 10https://gerrit.wikimedia.org/r/317981 (https://phabricator.wikimedia.org/T148779) [15:30:10] (03CR) 10Ottomata: [V: 032] eventstreams puppetization [puppet] - 10https://gerrit.wikimedia.org/r/317981 (https://phabricator.wikimedia.org/T148779) (owner: 10Ottomata) [15:30:41] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2048 (duration: 00m 50s) [15:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:12] !log upgrading cp3041 (text-esams) to varnish 4 -- T131503 [15:34:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:18] T131503: Convert text cluster to Varnish 4 - https://phabricator.wikimedia.org/T131503 [15:34:29] PROBLEM - Varnish HTCP daemon on cp2007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (vhtcpd), args vhtcpd [15:36:02] I guess this is akosiaris (T133387) ^ [15:36:03] T133387: HTCP purges flood across CODFW - https://phabricator.wikimedia.org/T133387 [15:36:54] yup [15:37:33] I have no idea how to reliably reproduce this freaking bug in a different environment than asw siwtches [15:37:55] so I use the scientific method. Mess up with things and observe what happens ;-) [15:38:07] :) [15:39:31] ok, I think I am gonna rule out vhtcpd... at least that something [15:40:44] there is low space on silver, if someone wants to check logs or something [15:41:29] RECOVERY - Varnish HTCP daemon on cp2007 is OK: PROCS OK: 1 process with UID = 114 (vhtcpd), args vhtcpd [15:47:09] RECOVERY - puppet last run on db1034 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [15:47:21] !log T133395: Converting the next 25 RESTBase keyspaces to TWCS [15:47:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:28] T133395: Evaluate TimeWindowCompactionStrategy - https://phabricator.wikimedia.org/T133395 [15:58:22] !log upgrading cp3042 (text-esams) to varnish 4 -- T131503 [15:58:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:28] T131503: Convert text cluster to Varnish 4 - https://phabricator.wikimedia.org/T131503 [16:00:56] 06Operations, 10ArticlePlaceholder, 10Traffic, 10Wikidata: Performance and caching considerations for article placeholders accesses - https://phabricator.wikimedia.org/T142944#2782663 (10hoo) @BBlack Given {T109458} is not implemented, there is no caching for these pages right now. If you consider this a r... [16:04:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [16:06:39] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [16:13:11] (03PS4) 10Reedy: Rename -labs to -beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320425 (https://phabricator.wikimedia.org/T150268) [16:17:40] 06Operations, 10EventBus, 10procurement: rack/setup kafka2003 - https://phabricator.wikimedia.org/T150340#2782678 (10RobH) [16:17:55] 06Operations, 10EventBus, 10hardware-requests: eqiad/codfw: 1+1 Kafka broker in main clusters in eqiad and codfw - https://phabricator.wikimedia.org/T145082#2782696 (10RobH) [16:17:57] 06Operations, 10EventBus, 10procurement: rack/setup kafka2003 - https://phabricator.wikimedia.org/T150340#2782678 (10RobH) [16:18:05] !log upgrading cp3043 (text-esams) to varnish 4 -- T131503 [16:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:11] T131503: Convert text cluster to Varnish 4 - https://phabricator.wikimedia.org/T131503 [16:21:09] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp3043 is CRITICAL: connect to address 10.20.0.178 and port 3123: Connection refused [16:21:12] 06Operations, 10ops-codfw: rack spare pool servers and update tracking sheet - https://phabricator.wikimedia.org/T150341#2782701 (10RobH) [16:21:44] 06Operations, 10ops-codfw, 10EventBus: rack/setup kafka2003 - https://phabricator.wikimedia.org/T150340#2782716 (10RobH) [16:22:09] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp3043 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.167 second response time [16:22:52] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2782718 (10Andrew) @Joe, I use the puppetmaster API: https://labcontrol1001.wikimedia.org:8140/puppet/resource_types/role The /role part is arbitrary,... [16:23:21] (03PS5) 10Reedy: Rename -labs to -beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320425 (https://phabricator.wikimedia.org/T150268) [16:24:15] (03PS6) 10Reedy: Rename -labs to -beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320425 (https://phabricator.wikimedia.org/T150268) [16:29:18] (03PS7) 10Reedy: Rename -labs to -beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320425 (https://phabricator.wikimedia.org/T150268) [16:32:18] (03PS9) 10Mobrovac: PDF Render Service: Role and module [puppet] - 10https://gerrit.wikimedia.org/r/305256 (https://phabricator.wikimedia.org/T143129) [16:33:46] (03CR) 10Reedy: [C: 031] "Ok, this is ready to go now, it seems..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320425 (https://phabricator.wikimedia.org/T150268) (owner: 10Reedy) [16:34:25] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2782727 (10Gilles) >>! In T66214#2782449, @Anomie wrote: > I think redirects are probably //essential//. You say that the performance impact of redirec... [16:37:05] 06Operations, 10ops-eqiad, 10hardware-requests: Decommission db1019 - https://phabricator.wikimedia.org/T147309#2782729 (10Southparkfan) [16:37:40] (03PS8) 10Reedy: Rename -labs to -beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320425 (https://phabricator.wikimedia.org/T150268) [16:38:13] (03CR) 10jenkins-bot: [V: 04-1] Rename -labs to -beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320425 (https://phabricator.wikimedia.org/T150268) (owner: 10Reedy) [16:39:29] (03PS9) 10Reedy: Rename -labs to -beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320425 (https://phabricator.wikimedia.org/T150268) [16:41:16] (03PS4) 10Dzahn: remove neon from puppet/netboot/dhcp/network [puppet] - 10https://gerrit.wikimedia.org/r/318437 (https://phabricator.wikimedia.org/T125023) [16:42:19] PROBLEM - puppet last run on ms-be1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:44:36] (03PS2) 10Andrew Bogott: wikistatus: fewer login tries with a longer delay between [puppet] - 10https://gerrit.wikimedia.org/r/320527 [16:49:04] (03CR) 10Andrew Bogott: [C: 032] wikistatus: fewer login tries with a longer delay between [puppet] - 10https://gerrit.wikimedia.org/r/320527 (owner: 10Andrew Bogott) [16:53:39] PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:55:09] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [16:55:09] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [17:00:09] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [17:00:10] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [17:00:10] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [17:00:10] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [17:05:09] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [17:05:09] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [17:05:09] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [17:05:09] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [17:06:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [17:08:39] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [17:10:09] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [17:10:09] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [17:10:10] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [17:10:10] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [17:10:10] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [17:10:19] RECOVERY - puppet last run on ms-be1012 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [17:15:09] PROBLEM - check_puppetrun on indium is CRITICAL: CRITICAL: puppet fail [17:15:09] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [17:15:09] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [17:15:09] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [17:15:10] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [17:15:19] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [17:18:09] PROBLEM - Host mc1035 is DOWN: PING CRITICAL - Packet loss = 100% [17:18:09] PROBLEM - Host mc1036 is DOWN: PING CRITICAL - Packet loss = 100% [17:18:24] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2782833 (10GWicke) >>>! In T66214#2782454, @GWicke wrote: >> This is because any thumbnail can have any number of author-supplied parameters already >... [17:19:09] PROBLEM - Host mc1034 is DOWN: PING CRITICAL - Packet loss = 100% [17:19:19] PROBLEM - Host mc1033 is DOWN: PING CRITICAL - Packet loss = 100% [17:20:09] PROBLEM - check_puppetrun on indium is CRITICAL: CRITICAL: puppet fail [17:20:09] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [17:20:10] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [17:20:10] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [17:20:10] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [17:20:10] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [17:20:19] RECOVERY - check_raid on payments2002 is OK: OK: HPSA [P222/slot1: OK, log_1: 465.7GB,RAID1 OK] [17:20:19] PROBLEM - PyBal backends health check on lvs1012 is CRITICAL: PYBAL CRITICAL - apaches_80 - Could not depool server mw1249.eqiad.wmnet because of too many down!: ocg_8000 - Could not depool server ocg1002.eqiad.wmnet because of too many down! [17:20:39] PROBLEM - PyBal backends health check on lvs1009 is CRITICAL: PYBAL CRITICAL - apaches_80 - Could not depool server mw1236.eqiad.wmnet because of too many down!: ocg_8000 - Could not depool server ocg1002.eqiad.wmnet because of too many down! [17:21:30] RECOVERY - Host mc1036 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [17:21:33] RECOVERY - Host mc1033 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [17:21:33] RECOVERY - Host mc1034 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [17:21:39] RECOVERY - Host mc1035 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [17:21:39] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [17:23:19] RECOVERY - PyBal backends health check on lvs1012 is OK: PYBAL OK - All pools are healthy [17:23:39] RECOVERY - PyBal backends health check on lvs1009 is OK: PYBAL OK - All pools are healthy [17:24:01] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2782850 (10Gilles) Provided by editors in the context of wikitext, yes. I thought you meant the media's author was somehow able to set default paramete... [17:24:19] PROBLEM - puppet last run on mc1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:24:19] PROBLEM - puppet last run on mc1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:25:09] PROBLEM - check_puppetrun on indium is CRITICAL: CRITICAL: puppet fail [17:25:09] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [17:25:09] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [17:25:10] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [17:25:10] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [17:25:19] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [17:28:27] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2782860 (10GWicke) @gilles: Having such author-supplied parameters in a thumb means that changing the size is not just a matter of adding a *single* qu... [17:28:37] 06Operations, 10Traffic, 10netops: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006 - https://phabricator.wikimedia.org/T150256#2782861 (10BBlack) [17:29:45] (03PS2) 10Filippo Giunchedi: prometheus: use systemd override for node-exporter [puppet] - 10https://gerrit.wikimedia.org/r/320434 (https://phabricator.wikimedia.org/T149992) [17:30:09] PROBLEM - check_puppetrun on indium is CRITICAL: CRITICAL: puppet fail [17:30:09] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [17:30:09] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [17:30:09] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [17:30:10] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [17:30:10] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [17:33:07] (03CR) 10Filippo Giunchedi: [C: 032] "Indeed Moritz!" [puppet] - 10https://gerrit.wikimedia.org/r/320434 (https://phabricator.wikimedia.org/T149992) (owner: 10Filippo Giunchedi) [17:34:18] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2782869 (10Gilles) In both cases you need to parse a string and insert the value in the right place. I hope we're not going to design an API a certain... [17:34:30] !log rebooting db2048 for kernel upgrade [17:34:31] 06Operations, 10netops: HTCP purges flood across CODFW - https://phabricator.wikimedia.org/T133387#2782871 (10akosiaris) Some more information about this. After quite a bit of debugging I 've gathered the following facts * The issue is present across all CODFW rows as well as asw2-d-eqiad * The issue only man... [17:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:09] PROBLEM - check_puppetrun on indium is CRITICAL: CRITICAL: puppet fail [17:35:09] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [17:35:09] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [17:35:10] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [17:35:10] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [17:35:10] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [17:35:24] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring: prometheus-node-exporter package should use a systemd override - https://phabricator.wikimedia.org/T149992#2782873 (10fgiunchedi) 05Open>03Resolved Fixed in https://gerrit.wikimedia.org/r/320434 by shipping an override file as suggested [17:36:05] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2782875 (10Gilles) You can look at any two formats and make up a use case that would be easier in one than the other. I don't think that should be the... [17:36:09] PROBLEM - puppet last run on mw1193 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:36:09] PROBLEM - puppet last run on wtp2018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:37:09] RECOVERY - puppet last run on mw1193 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [17:40:09] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [17:40:09] PROBLEM - check_puppetrun on indium is CRITICAL: CRITICAL: puppet fail [17:40:10] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [17:40:10] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [17:40:10] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [17:40:10] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [17:41:37] !log the puppet failures on the frack hosts are known and have been reported to jeff [17:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:04] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2782890 (10GWicke) The point is that a custom ordered-querystring serialization function will be needed, while in the path-based API regular path manip... [17:45:09] PROBLEM - check_puppetrun on indium is CRITICAL: CRITICAL: puppet fail [17:45:09] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [17:45:09] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [17:45:09] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [17:45:10] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [17:45:10] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [17:48:19] RECOVERY - puppet last run on mc1036 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [17:48:53] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 12 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2782913 (10Yurik) @KartikMistry I agree, but we don't have Kartotherian/Tilerator running in beta cluster (we should). @mobrovac, seems like my git repo got corrupted. Still... [17:49:19] RECOVERY - puppet last run on mc1035 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [17:49:50] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2782915 (10Gilles) The pros and cons of the current URI scheme should be included in the same matrix. Because you'll find that almost all of the upside... [17:50:09] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [17:50:09] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [17:50:09] PROBLEM - check_puppetrun on indium is CRITICAL: CRITICAL: puppet fail [17:50:09] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [17:50:10] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [17:50:10] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [17:51:21] (03PS1) 10Filippo Giunchedi: role: use single ferm rule for prometheus node/varnish exporter [puppet] - 10https://gerrit.wikimedia.org/r/320622 [17:52:59] (03CR) 10jenkins-bot: [V: 04-1] role: use single ferm rule for prometheus node/varnish exporter [puppet] - 10https://gerrit.wikimedia.org/r/320622 (owner: 10Filippo Giunchedi) [17:55:09] PROBLEM - check_puppetrun on indium is CRITICAL: CRITICAL: puppet fail [17:55:09] RECOVERY - check_puppetrun on barium is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [17:55:09] RECOVERY - check_puppetrun on backup4001 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [17:55:09] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [17:55:09] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [17:55:10] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [17:55:56] thanks rObh for the heads up [17:57:29] yeah jeff repolied to my text and said he would be back to fix [17:57:38] seems better to let it spam every 5 minutes than silence it [17:57:48] though i guess i could silence it for the next 30 minutes safely enough [17:57:49] PROBLEM - puppet last run on maps1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:58:14] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 0.1.29 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/320374 (https://phabricator.wikimedia.org/T150208) (owner: 10Gilles) [17:59:01] 06Operations, 10ops-codfw, 10fundraising-tech-ops: payments2002 disk failure - https://phabricator.wikimedia.org/T149646#2782959 (10Papaul) a:05Papaul>03Jgreen Disk replacement complete [18:00:09] RECOVERY - check_puppetrun on samarium is OK: OK: Puppet is currently enabled, last run 63 seconds ago with 0 failures [18:00:09] RECOVERY - check_puppetrun on indium is OK: OK: Puppet is currently enabled, last run 256 seconds ago with 0 failures [18:00:09] RECOVERY - check_puppetrun on lutetium is OK: OK: Puppet is currently enabled, last run 241 seconds ago with 0 failures [18:00:10] RECOVERY - check_puppetrun on db1025 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [18:00:48] bleh [18:00:54] heh [18:00:55] i just set to silence for 30 and it clears, hahaha [18:02:21] (03PS1) 10Filippo Giunchedi: Update debian/changelog to 0.1.29 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/320624 [18:02:41] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2782978 (10GWicke) [18:02:49] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS6461/IPv6: Active, AS6461/IPv4: Active [18:03:06] gilles: ^ that should trigger jenkins to DTRT [18:03:41] (03PS3) 10Filippo Giunchedi: Set environment variables for ImageMagick running inside Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/319807 (https://phabricator.wikimedia.org/T149985) (owner: 10Gilles) [18:04:09] RECOVERY - puppet last run on wtp2018 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [18:04:37] godog: ah, yeah, forgot the changelog, my bad [18:04:49] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 17, down: 0, shutdown: 0 [18:05:01] (03CR) 10Gilles: [C: 032] Update debian/changelog to 0.1.29 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/320624 (owner: 10Filippo Giunchedi) [18:05:13] np, still doesn't work on jenkins anyways since backports are not included [18:06:19] PROBLEM - MariaDB Slave Lag: s1 on db2042 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2811.72 seconds [18:06:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [18:07:03] (03CR) 10Filippo Giunchedi: [C: 032] Set environment variables for ImageMagick running inside Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/319807 (https://phabricator.wikimedia.org/T149985) (owner: 10Gilles) [18:07:23] ^that is me [18:07:36] it expired, I thought it was going to take less [18:07:36] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2782999 (10Gilles) I'm sorry but the pros and cons you've entered are completely biased and focus on very subjective properties. [18:07:39] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [18:08:18] ah ha [18:09:45] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2783005 (10Gilles) "Easy to select size and thumb type." How is it not easy in the second example? What constitutes "mis-use"? [18:13:07] 06Operations, 10ops-eqiad, 10media-storage: diagnose failed disks on ms-be1027 - https://phabricator.wikimedia.org/T140374#2783013 (10Cmjohnson) @fgiunchedi replaced, both backplanes, system board and rear ssds. I am able to pxe boot now. I will leave this open. If all goes well please resolve this task...... [18:13:08] (03CR) 10Filippo Giunchedi: [C: 032] Update debian/changelog to 0.1.29 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/320624 (owner: 10Filippo Giunchedi) [18:17:05] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2783026 (10Gilles) I would once again recommend that you write code that would parse, generate and cover a few known use cases for all considered schem... [18:17:34] (03PS5) 10Dzahn: remove neon from puppet/netboot/dhcp/network [puppet] - 10https://gerrit.wikimedia.org/r/318437 (https://phabricator.wikimedia.org/T125023) [18:21:19] RECOVERY - MariaDB Slave Lag: s1 on db2042 is OK: OK slave_sql_lag Replication lag: 0.39 seconds [18:21:57] (03PS1) 10Andrew Bogott: wikistatus: Further attempts to get reliable page edits [puppet] - 10https://gerrit.wikimedia.org/r/320626 [18:22:51] (03PS1) 10Yurik: Enable on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320627 (https://phabricator.wikimedia.org/T138057) [18:23:51] (03CR) 10Andrew Bogott: [C: 032] wikistatus: Further attempts to get reliable page edits [puppet] - 10https://gerrit.wikimedia.org/r/320626 (owner: 10Andrew Bogott) [18:25:49] RECOVERY - puppet last run on maps1002 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [18:26:07] (03CR) 10Dzahn: [C: 032] remove neon from puppet/netboot/dhcp/network [puppet] - 10https://gerrit.wikimedia.org/r/318437 (https://phabricator.wikimedia.org/T125023) (owner: 10Dzahn) [18:26:12] (03PS6) 10Dzahn: remove neon from puppet/netboot/dhcp/network [puppet] - 10https://gerrit.wikimedia.org/r/318437 (https://phabricator.wikimedia.org/T125023) [18:26:29] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:29:29] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:29:43] 06Operations, 10Cassandra, 06Services (doing): Upload cassandra-tools-wmf Debian package to apt.w.o - https://phabricator.wikimedia.org/T150304#2783048 (10fgiunchedi) @Eevans I can take care of that, though I'm not seeing the debian branch on the gerrit repo yet? [18:30:27] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2783050 (10Gilles) [18:30:30] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Set IM Thumbor engine environment variables - https://phabricator.wikimedia.org/T149985#2783049 (10Gilles) 05Open>03Resolved [18:33:29] !log deploy python-thumbor-wikimedia 0.1.29 to thumbor100[12] [18:33:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:42] !log neon (formerly icinga) remove from puppet, revoke cert, delete salt key, stop icinga service ... [18:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:26] !log neon stopping nsca and apache [18:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:50] !log partitioning db2042- it will have temporarily lag for 10-20 hours [18:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:35] mutante: \o/ [18:38:55] (03PS4) 10Filippo Giunchedi: Rotate Thumbor 404 log by size, not date [puppet] - 10https://gerrit.wikimedia.org/r/320273 (https://phabricator.wikimedia.org/T150208) (owner: 10Gilles) [18:39:04] (03PS3) 10BBlack: remove debian perl ldflags patch [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320161 [18:39:06] (03PS3) 10BBlack: depend on lsb-base >= 3.0-6 [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320162 [18:39:08] (03PS2) 10BBlack: remove stapling_proxy patch [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320113 [18:39:10] (03PS6) 10BBlack: nginx (1.11.4-1+wmf14) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/319776 [18:39:12] (03PS4) 10BBlack: add stapling-multi-file patch [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320115 [18:39:14] (03PS2) 10BBlack: remove readahead patch [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320114 [18:39:16] (03PS1) 10BBlack: new variant for ECDHE curve logging [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320629 [18:42:59] PROBLEM - NTP peers on acamar is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [18:43:17] (03CR) 10Filippo Giunchedi: [C: 032] Rotate Thumbor 404 log by size, not date [puppet] - 10https://gerrit.wikimedia.org/r/320273 (https://phabricator.wikimedia.org/T150208) (owner: 10Gilles) [18:43:19] (03PS5) 1020after4: `scap patch` tool for applying patches to a wmf/branch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312013 [18:43:59] RECOVERY - NTP peers on acamar is OK: NTP OK: Offset 0.000156 secs [18:46:00] PROBLEM - NTP peers on achernar is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [18:46:43] 06Operations, 10Cassandra, 06Services (doing): Upload cassandra-tools-wmf Debian package to apt.w.o - https://phabricator.wikimedia.org/T150304#2783093 (10Eevans) >>! In T150304#2783048, @fgiunchedi wrote: > @Eevans I can take care of that, though I'm not seeing the debian branch on the gerrit repo yet? Tha... [18:47:29] !log Stopping MySQL dbstore2001 - taking a snapshot - T149457 [18:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:35] T149457: install new disks into dbstore2001 - https://phabricator.wikimedia.org/T149457 [18:47:59] RECOVERY - NTP peers on achernar is OK: NTP OK: Offset 0.000794 secs [18:48:15] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2783095 (10Anomie) > and (generally) avoid specifying default values explicitly. (Exception might be the page parameter.) Maybe you didn't get it afte... [18:50:18] (03CR) 10Thcipriani: [C: 031] "Few nitpicks, looks good overall" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312013 (owner: 1020after4) [18:54:19] PROBLEM - NTP peers on maerlant is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [18:56:19] RECOVERY - NTP peers on maerlant is OK: NTP OK: Offset 0.002356 secs [18:57:10] PROBLEM - NTP peers on chromium is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [18:58:09] RECOVERY - NTP peers on chromium is OK: NTP OK: Offset -0.000929 secs [18:58:30] feels like icinga-wm's rate limit kicks in or so [18:58:51] there are more of these in web ui, but it reports only a few selected ones [18:59:19] they are all fixing themselves eventually [18:59:29] 06Operations, 10Cassandra, 06Services (doing): Upload cassandra-tools-wmf Debian package to apt.w.o - https://phabricator.wikimedia.org/T150304#2783143 (10fgiunchedi) >>! In T150304#2783093, @Eevans wrote: > That's odd; It shows up in diffusion [[ https://phabricator.wikimedia.org/diffusion/ODCTW/branches/m... [18:59:55] it's because NTP config changed on all because neon was removed [19:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161109T1900). Please do the needful. [19:00:05] yurik and maxsem: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [19:00:10] PROBLEM - NTP peers on hydrogen is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [19:01:09] RECOVERY - NTP peers on hydrogen is OK: NTP OK: Offset 0.004642 secs [19:03:09] PROBLEM - NTP peers on nescio is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [19:03:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [19:05:04] (03PS6) 1020after4: `scap patch` tool for applying patches to a wmf/branch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312013 [19:05:10] RECOVERY - NTP peers on nescio is OK: NTP OK: Offset 0.001338 secs [19:05:49] (03CR) 1020after4: "Nits picked and logging improved." (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312013 (owner: 1020after4) [19:06:04] thcipriani|afk: ^ [19:07:39] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [19:08:15] my apologies, seems like my IRC client froze. Is anyone swating? [19:08:33] jouncebot: refresh [19:08:36] I refreshed my knowledge about deployments. [19:08:38] can someone who knows more about such things check into MW fatals alert above? [19:08:39] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint: Investigate how Kartotherian metrics are published and what they mean - https://phabricator.wikimedia.org/T149889#2783162 (10Gehel) [19:08:39] jouncebot: next [19:08:40] In 0 hour(s) and 51 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161109T2000) [19:08:55] (03PS1) 10Filippo Giunchedi: mtail: comment temporarily EDAC fine grained reporting to avoid barfing [puppet] - 10https://gerrit.wikimedia.org/r/320631 [19:09:01] there's a user in -tech reporting a strange 503 when editing logged in, that repeated a bit [19:09:06] oic morning swat now [19:09:07] apergos, do you know who's running the swat? [19:09:26] yurik, I do not [19:09:45] bblack I can look at fluorine and see if anything interesting is reported [19:09:49] not much beyond that though [19:10:26] my logstash-foo is weak, but that might be something too [19:11:26] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2783168 (10GWicke) > Maybe you didn't get it after all. There's nothing special about the "page" parameter, you could have the same problem with "t=0s"... [19:12:06] !log upload cassandra-tools-wmf 1.0.0-1 to jessie-wikimedia on carbon - T150304 [19:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:14] T150304: Upload cassandra-tools-wmf Debian package to apt.w.o - https://phabricator.wikimedia.org/T150304 [19:13:11] (03CR) 10Filippo Giunchedi: [C: 032] mtail: comment temporarily EDAC fine grained reporting to avoid barfing [puppet] - 10https://gerrit.wikimedia.org/r/320631 (owner: 10Filippo Giunchedi) [19:16:20] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint: delete unused kartotherian marke metrics - https://phabricator.wikimedia.org/T150353#2783175 (10Gehel) [19:19:00] (03PS1) 10Jcrespo: Revert "Depool db2048 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320633 [19:19:24] thcipriani|afk, Reedy, do you know whos' doing the swat today? [19:19:40] Dereckson? [19:20:11] yurik: yes, I can SWAT in five minutes [19:20:18] thx! [19:23:55] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2783210 (10GWicke) [19:25:19] 06Operations, 06Maps, 03Interactive-Sprint: Increase frequency of OSM replication - https://phabricator.wikimedia.org/T137939#2783219 (10Gehel) Better metrics / dashboard is required to have visibility on what is happening. [19:26:19] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint: delete unused kartotherian marke metrics - https://phabricator.wikimedia.org/T150353#2783220 (10Gehel) p:05Triage>03Normal [19:27:14] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint: delete unused kartotherian marker metrics - https://phabricator.wikimedia.org/T150353#2783223 (10MaxSem) [19:27:34] 06Operations, 06Discovery, 06Maps, 07Epic, 03Interactive-Sprint: Investigate how Kartotherian metrics are published and what they mean - https://phabricator.wikimedia.org/T149889#2783224 (10Gehel) [19:29:26] 06Operations, 06Discovery, 06Maps, 07Epic, 03Interactive-Sprint: Investigate how Kartotherian metrics are published and what they mean - https://phabricator.wikimedia.org/T149889#2767903 (10Gehel) p:05Triage>03High [19:31:23] (03CR) 10Dereckson: [C: 032] Enable on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320627 (https://phabricator.wikimedia.org/T138057) (owner: 10Yurik) [19:32:08] (03Merged) 10jenkins-bot: Enable on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320627 (https://phabricator.wikimedia.org/T138057) (owner: 10Yurik) [19:33:07] yurik: live on mw1099 [19:37:31] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 11 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2748980 (10Yurik) [19:38:20] (03PS1) 10Filippo Giunchedi: thumbor: revert MAGICK_TIME_LIMIT=60 [puppet] - 10https://gerrit.wikimedia.org/r/320636 (https://phabricator.wikimedia.org/T149985) [19:38:55] Dereckson, thx, checking [19:39:41] 06Operations, 10Cassandra, 06Services (doing): Upload cassandra-tools-wmf Debian package to apt.w.o - https://phabricator.wikimedia.org/T150304#2783277 (10Eevans) >>! In T150304#2783143, @fgiunchedi wrote: > I take it we're going to need the package available only for jessie internally? I'm not sure what ot... [19:39:41] Dereckson, looks good [19:40:09] (03CR) 10Filippo Giunchedi: [C: 032] thumbor: revert MAGICK_TIME_LIMIT=60 [puppet] - 10https://gerrit.wikimedia.org/r/320636 (https://phabricator.wikimedia.org/T149985) (owner: 10Filippo Giunchedi) [19:40:11] Dereckson: Can I add https://gerrit.wikimedia.org/r/#/c/320555/ to the swat still? [19:42:13] (03CR) 10Alex Monk: "Because we're adding _letsencrypt, the difference should be explicit" [puppet] - 10https://gerrit.wikimedia.org/r/313805 (https://phabricator.wikimedia.org/T144293) (owner: 10Alex Monk) [19:42:29] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:42:57] Dereckson: Nevermind. I'm blind. It's already in this weeks' train. [19:44:44] (03PS4) 10Alex Monk: Split check_ssl between traditional year-long certs and LE's 3 month certs [puppet] - 10https://gerrit.wikimedia.org/r/313805 (https://phabricator.wikimedia.org/T144293) [19:48:54] Krinkle: ack'ed [19:49:08] 06Operations, 10Analytics, 10ChangeProp, 10Citoid, and 11 others: Node 6 upgrade planning - https://phabricator.wikimedia.org/T149331#2783313 (10GWicke) @MoritzMuehlenhoff, thanks a lot for preparing the node debs! [19:49:19] Dereckson, synced? [19:49:58] yurik: I was watching logs, they look good too, I'm syncing [19:50:05] thx :) [19:51:29] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:51:35] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Enable on ruwiki (T138057) (duration: 00m 48s) [19:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:41] T138057: Epic: Enable on Wikipedia - https://phabricator.wikimedia.org/T138057 [19:56:29] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:57:03] on fatalmonitor: [19:57:04] 142 pl proc line: 2959: warning: points must have either 4 or 2 values per line [19:57:10] 61 Fatal error: unknown exception [19:57:32] first is ploticus like usual, second is worrying [19:58:29] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161109T2000). [20:01:10] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: Wikidata Query Service is overly verbose toward logstash - https://phabricator.wikimedia.org/T150356#2783356 (10Gehel) [20:02:33] (03PS1) 1020after4: group1 wikis to 1.29.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320645 [20:02:35] (03CR) 1020after4: [C: 032] group1 wikis to 1.29.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320645 (owner: 1020after4) [20:03:07] (03Merged) 10jenkins-bot: group1 wikis to 1.29.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320645 (owner: 1020after4) [20:03:14] twentyafterfour: check perhaps before to deploy the source of this: 19:57:05 < Dereckson> 61 Fatal error: unknown exception [20:04:05] Dereckson: that error is a bit vague [20:04:36] "a bit" [20:05:18] !log statsv->graphite has been down for 9 hours since roughly 10AM UTC [20:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:31] !log Killed statsv.py process on hafnium. Seems to have fixed it. [20:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:45] it rebooted [20:06:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [20:07:39] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [20:08:28] twentyafterfour: I checked hhvm.log, nothing else afterwards is logged :/ [20:08:29] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:13:05] !log created missing wikilove_log tables on azwiki and labtestwiki - T150321 [20:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:11] T150321: Table 'azwiki.wikilove_log' doesn't exist - https://phabricator.wikimedia.org/T150321 [20:15:29] Krenair: hafnium rebooted? [20:15:55] godog, hmm? [20:16:07] I don't have access to that host let alone the ability to reboot it [20:16:58] (03PS6) 10Dzahn: tcpircbot: improve firewall rule setup [puppet] - 10https://gerrit.wikimedia.org/r/316497 [20:17:41] Krenair: sigh, that was meant from Krinkle [20:17:59] (03CR) 10Dzahn: tcpircbot: improve firewall rule setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/316497 (owner: 10Dzahn) [20:18:29] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:18:40] 06Operations, 10Analytics-General-or-Unknown, 10Graphite, 06Performance-Team, 07Wikimedia-Incident: statsv outage on 2016-11-09 - https://phabricator.wikimedia.org/T150359#2783449 (10Krinkle) [20:18:41] godog: https://phabricator.wikimedia.org/T150359 [20:19:10] 06Operations, 10Analytics-General-or-Unknown, 10Graphite, 06Performance-Team, 07Wikimedia-Incident: statsv outage on 2016-11-09 - https://phabricator.wikimedia.org/T150359#2783461 (10Krinkle) [20:19:10] (03PS1) 10Catrope: Add Flow test namespace to all labs wikis that have Flow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320647 [20:19:11] (03PS1) 10Catrope: Enable Flow beta feature on hewiki in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320648 [20:19:21] godog: I've added hafnium reboot to the SAL except [20:19:23] good point [20:19:25] I didn't see that [20:22:04] 06Operations, 10Analytics-General-or-Unknown, 10Graphite, 06Performance-Team, 07Wikimedia-Incident: statsv outage on 2016-11-09 - https://phabricator.wikimedia.org/T150359#2783470 (10Krinkle) [20:22:13] elukey: [20:22:41] Krinkle: heh still odd that at least puppet didn't bring it back up [20:22:59] godog: The process was running all this time. [20:23:10] Or at least some process was running. [20:23:21] but it didn't work and left no syslog entry either [20:23:29] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:23:33] it crashed somehow, but didn't kill the process [20:23:39] i had to kill it just now [20:24:04] Interesting that hafnium rebooted an hour after the incident [20:24:08] You'd think that would bring it back up at least [20:24:47] indeed, also I misread the log I thought the traceback meant the process as a whole exited [20:25:09] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1385 [20:25:40] godog: What does it mean when ps aux reports a time higher than 24:00? [20:25:43] Nov08 43:51 [20:25:48] Nov08 57:58 [20:26:26] 06Operations, 10Cassandra, 06Services (doing): Upload cassandra-tools-wmf Debian package to apt.w.o - https://phabricator.wikimedia.org/T150304#2783480 (10Eevans) 05Open>03Resolved a:05Eevans>03fgiunchedi [20:26:34] Oh, right. tat's not the time [20:26:35] lol [20:26:36] that's uptime [20:26:54] ehh almost, the accumulated cpu time so far [20:27:06] yeah [20:28:26] also statsv doesn't clean up after itself? it has some zombies [20:29:20] 06Operations, 10Analytics-General-or-Unknown, 10Graphite, 06Performance-Team, 07Wikimedia-Incident: statsv outage on 2016-11-09 - https://phabricator.wikimedia.org/T150359#2783488 (10Krinkle) [20:29:48] godog: Where? [20:30:09] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1685 [20:32:01] !log T133395: Converting next 25 RESTBase tables to time-window compaction [20:32:06] (03Abandoned) 10BBlack: Bugfix for ECDHE curve logging [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/319775 (owner: 10BBlack) [20:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:08] T133395: Evaluate TimeWindowCompactionStrategy - https://phabricator.wikimedia.org/T133395 [20:32:29] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:33:54] Krinkle: I've seen it via ps fwwwaux [20:33:55] nobody 12665 4.9 0.3 902184 111576 ? Ssl 20:01 1:23 /usr/bin/python /srv/deployment/statsv/statsv/statsv.py [20:33:59] nobody 12675 0.0 0.0 0 0 ? Z 20:01 0:01 \_ [python] [20:35:09] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1985 [20:36:30] godog: Hm.. that may've been be being too impulsive trying to fix it without reading the manual first. [20:36:35] be*me [20:36:47] In fact, I'm sure of it. [20:36:53] because it's down again [20:37:50] (03PS2) 10BBlack: new variant for ECDHE curve logging [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320629 [20:37:53] (03PS7) 10BBlack: nginx (1.11.4-1+wmf14) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/319776 [20:37:54] (03PS5) 10BBlack: add stapling-multi-file patch [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320115 [20:38:04] Krinkle: it did this http://www.reactiongifs.com/r/gtfafm1.gif [20:39:19] godog: I killed the one process I saw and expected to see and then multiple started. I thought I did something wrong and stopped the duplicates. I'm regretting that now. I knew it used multiple threads, but didn't expect to see multiple processes. [20:39:24] Maybe it is supposed to though [20:39:56] kill one and get multiple back, usually seems wrong. [20:40:03] yeah it is, it uses queue from multiprocess afaics [20:40:09] RECOVERY - check_mysql on lutetium is OK: Uptime: 182209 Threads: 5 Questions: 9019612 Slow queries: 1014 Opens: 656009 Flush tables: 2 Open tables: 64 Queries per second avg: 49.501 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [20:40:12] multiprocess module that is [20:40:29] OK. Good. I'll let it restore to normal. [20:40:30] Fatal error: unknown exception ...is a mystery [20:47:34] twentyafterfour: ^ what's that in reference to? [20:48:49] bblack: random fatals in logstash / syslog on fluorine (hhvm errors) [20:49:09] * apergos lurks [20:49:34] twentyafterfour: we also had some reports about 503s, from 2x users (US/CA user hitting enwiki via ulsfo, EU user hitting dewiki via esams) who were logged-in and trying to submit edits [20:49:47] https://phabricator.wikimedia.org/P4402 [20:49:53] I *think* those 503s were due to an MW-level fatal that aborted the connection and thus became 503 instead of 500 [20:50:15] 1-2h ago [20:51:09] it's happening pretty regularly [20:51:32] one came back and re-reported a bit later, but then poofed [20:52:23] (03PS2) 10Filippo Giunchedi: role: use single ferm rule for prometheus node/varnish exporter [puppet] - 10https://gerrit.wikimedia.org/r/320622 [20:53:01] twentyafterfour: where did you look to find those? [20:53:19] logstash I think [20:53:21] 06Operations, 10Traffic, 10media-storage: Unexplained increase in thumbnail 500s - https://phabricator.wikimedia.org/T147648#2699497 (10JoeWalsh) `Wikipedia/942 CFNetwork/808.0.2 Darwin/16.0.0` is a user agent passed by the iOS app's WebView (which displays the article HTML). In some cases, the iOS app alter... [20:53:28] apergos: logstash / hhvm.log [20:53:38] (03CR) 10jenkins-bot: [V: 04-1] role: use single ferm rule for prometheus node/varnish exporter [puppet] - 10https://gerrit.wikimedia.org/r/320622 (owner: 10Filippo Giunchedi) [20:57:38] ok, I have been linked the kibana url for those for the future (thanks!) [20:58:33] (03PS3) 10Filippo Giunchedi: role: use single ferm rule for prometheus node/varnish exporter [puppet] - 10https://gerrit.wikimedia.org/r/320622 [20:58:56] I'm going ahead with the train. the fatal errors are happening on wmf.1 not wmf.2 [20:59:36] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.29.0-wmf.2 [20:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:04] gwicke, cscott, arlolra, subbu, bearND, mdholloway, halfak, Amir1, and yurik: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161109T2100). Please do the needful. [21:02:07] Now this is showing up: CAS update failed on user_touched for user ID '55' (read from replica); the version of the user to be saved is older than the current version. [21:02:29] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [21:02:41] no parsoid deploy [21:04:23] 06Operations, 10Traffic, 10media-storage: Unexplained increase in thumbnail 500s - https://phabricator.wikimedia.org/T147648#2783750 (10JoeWalsh) I made a change to prevent the iOS app from requesting zero width thumbnails. Should go out with version 5.3.0 [21:04:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [21:06:39] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [21:07:42] (03PS3) 10BBlack: new variant for ECDHE curve logging [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320629 (https://phabricator.wikimedia.org/T144523) [21:07:44] (03PS4) 10BBlack: remove debian perl ldflags patch [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320161 [21:07:46] (03PS4) 10BBlack: depend on lsb-base >= 3.0-6 [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320162 [21:07:48] (03PS3) 10BBlack: remove stapling_proxy patch [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320113 (https://phabricator.wikimedia.org/T93927) [21:07:50] (03PS8) 10BBlack: nginx (1.11.4-1+wmf14) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/319776 [21:07:52] (03PS6) 10BBlack: add stapling-multi-file patch [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320115 (https://phabricator.wikimedia.org/T93927) [21:07:54] (03PS3) 10BBlack: remove readahead patch [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320114 (https://phabricator.wikimedia.org/T148917) [21:09:13] javascript issues on Wikidata. [21:09:15] "TypeError: mw.popups.removeTooltips is not a function. (In 'mw.popups.removeTooltips($elements)', 'mw.popups.removeTooltips' is undefined)" [21:10:00] tons of "Redis exception connecting to "rdb1005.eqiad.wmnet:6379"" [21:10:08] sjoerddebruin: thank you! [21:10:36] sjoerddebruin: where do you see that? [21:11:11] twentyafterfour: well literally every Wikidata page [21:11:19] sjoerddebruin: I am not seeing it [21:12:23] Weird, do you have the hovercards beta feature enabled? [21:12:38] no but I just enabled all beta features just for testing [21:12:42] (as that is where the error is referring to, I think) [21:12:47] yep that triggered it [21:13:11] https://gerrit.wikimedia.org/r/#/c/316978/ is probably to blame [21:14:29] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [21:14:59] !log starting mobileapps deploy [21:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:57] jhobs: please see my comments here ^ [21:18:08] !log deployed mobileapps 106f4cd [21:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:20] twentyafterfour: what to do? this is hitting 1.483± users on Wikidata alone [21:20:09] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1321 [21:20:28] sjoerddebruin: I'm not sure, we either try to fix it or revert the patch [21:21:35] concern raised on https://phabricator.wikimedia.org/rEPOP0ff40a6532b57001b753d29c47b7477d905a9a39 [21:21:54] I've did the same on Gerrt [21:21:56] Gerrit* [21:22:11] Not sure what the preferred method is, but I did everything I could. [21:22:25] best to file a task [21:22:31] no one reads comment on commits on phab [21:22:40] and people often don't read comments on gerrit, even though they should [21:23:16] Pff I hope we have all intergrated in one system soon.... [21:24:50] 06Operations, 10Traffic, 10media-storage: Unexplained increase in thumbnail 500s - https://phabricator.wikimedia.org/T147648#2783834 (10BBlack) @JoeWalsh - sounds great! can we get a link to the change? [21:25:09] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1621 [21:26:16] 06Operations, 10netops: Low IPv6 bandwth from Free.fr (AS12322) > Zayo > eqiad - https://phabricator.wikimedia.org/T150374#2783846 (10hashar) [21:27:34] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2783861 (10GWicke) [21:30:06] sjoerddebruin: if in doubt, fill a task. That is my motto :] [21:30:09] RECOVERY - check_mysql on lutetium is OK: Uptime: 185209 Threads: 3 Questions: 9191916 Slow queries: 1048 Opens: 665829 Flush tables: 2 Open tables: 64 Queries per second avg: 49.629 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [21:30:10] be advised any moment now we should be deploying a new patch to grrrit-wm. Just so you guys are aware [21:31:06] hashar: well I'm hoping leaving a message on the current one would ping the relevant people. [21:31:11] grrrit-wm-test: help [21:31:19] grrrit-wm: help [21:31:30] grrrit-wm: is now in normal operation, have a great day! [21:31:39] My current commands are: grrrit-wm: restart, grrrit-wm: force-restart, and grrrit-wm: nick [21:31:41] grrrit-wm: help [21:31:43] My current commands are: grrrit-wm: restart, grrrit-wm: force-restart, and grrrit-wm: nick [21:34:00] sjoerddebruin: hmm that issue was discovered with the patch and fixed... I think there may be a caching problem or something [21:34:35] Local or server? [21:34:38] Unfortunately I'm nowhere near my workstation and won't be until tomorrow (I get IRC pings on mobile) :/ [21:35:10] Try pinging bmansurov, he may be able to assist [21:35:50] Not sure which on the caching [21:35:59] If that is indeed to blame [21:35:59] Well, you already mentioned his name so hopefully he sees it. [21:36:33] He probably checks #wikimedia-mobile more often if you don't get a response [21:37:28] But yeah i remember that error being an issue and fixed [21:37:43] Well it's not only a error but also blocking the Javascript loading [21:38:01] Which means either it didn't hit the same branch or some weird caching probably [21:42:16] Sorry I can't be of more help from my phone sjoerddebruin :( if worse comes to worst, you could disable the beta feature and/or extension and SWAT it [21:43:06] A rollback is too much? [21:46:37] seems like cache because I can't find reference to the old identifier in any code [21:47:09] (03PS2) 10EBernhardson: [cirrus] Rename CirrusSearchMoreLikeThisCluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313037 [21:47:25] (03PS2) 10EBernhardson: [cirrus] Remove deprecated per-user poolcounter config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/313211 [21:48:14] So what now? [21:53:17] 06Operations, 06Discovery, 10Elasticsearch, 10hardware-requests, 06Discovery-Search (Current work): elasticsearch new servers (5x eqiad / 12x codfw) - https://phabricator.wikimedia.org/T149089#2741596 (10RobH) [21:56:55] 06Operations, 10Gerrit, 10grrrit-wm, 13Patch-For-Review: Support restarting grrrit-wm automatically when we restart production gerrit - https://phabricator.wikimedia.org/T149609#2783986 (10Zppix) have no clue why this isnt marked resolved [21:57:03] (03PS1) 10Ottomata: Deploy EventStreams on scb and configure LVS service in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/320690 (https://phabricator.wikimedia.org/T143925) [21:57:11] 06Operations, 10Gerrit, 10grrrit-wm, 13Patch-For-Review: Support restarting grrrit-wm automatically when we restart production gerrit - https://phabricator.wikimedia.org/T149609#2783989 (10Zppix) 05Open>03Resolved [22:03:24] 06Operations, 06Discovery, 10Elasticsearch, 10hardware-requests, 06Discovery-Search (Current work): elasticsearch new servers (5x eqiad / 12x codfw) - https://phabricator.wikimedia.org/T149089#2784005 (10RobH) a:05RobH>03Gehel Some questions regarding elastic search specifications: It should be note... [22:03:37] 06Operations, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review, 07Wikimedia-Incident: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2784008 (10Andrew) 05Open>03Resolved we've gone > a week without leaks. Seems unrealistic to close this, but ther... [22:04:17] the error doesn't seem to be breaking anything [22:04:22] (on wikidata) [22:11:30] None of the Javascript features work [22:11:36] Including adding or editing statements. [22:11:52] Could be my browser, Safari. [22:11:53] 06Operations, 06Discovery, 10Elasticsearch, 10hardware-requests, 06Discovery-Search (Current work): elasticsearch new servers (5x eqiad / 12x codfw) - https://phabricator.wikimedia.org/T149089#2784022 (10RobH) I neglected to ask: * If the increased core count leads to higher utilization on the new syste... [22:12:13] sjoerddebruin: hey, what's the issue? [22:12:43] bmansurov: see https://phabricator.wikimedia.org/T142723#2783832 [22:12:48] ok [22:14:21] (03CR) 10MaxSem: Switch discovery-stats cronjob to a dedicated script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/319252 (https://phabricator.wikimedia.org/T149722) (owner: 10MaxSem) [22:18:59] (03PS1) 10BBlack: Revert "cp1008: disable do_ocsp_int while experimenting with nginx packages" [puppet] - 10https://gerrit.wikimedia.org/r/320696 (https://phabricator.wikimedia.org/T93927) [22:19:01] (03PS1) 10BBlack: Revert "tlsproxy: experimental support for internal ocsp" [puppet] - 10https://gerrit.wikimedia.org/r/320697 (https://phabricator.wikimedia.org/T93927) [22:19:55] sjoerddebruin: I see the codebase is up-to-date and the change that fixes the issue was merged on November 3, which should be in production by now. [22:20:09] sjoerddebruin: also, when I append ?debug=true to the URL I don't see the error anymore. [22:21:23] 06Operations, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review, 07Wikimedia-Incident: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2784034 (10hashar) Good news @Andrew thank you. [22:21:53] bmansurov: it still shows up here when I do that [22:22:09] hmm, link? [22:22:32] I just tried a random item. https://www.wikidata.org/wiki/Q5554728?debug=true [22:22:37] thanks [22:23:11] sjoerddebruin: did you say you're using safari? [22:23:16] Yup [22:23:59] (03CR) 10BBlack: [C: 032] Revert "cp1008: disable do_ocsp_int while experimenting with nginx packages" [puppet] - 10https://gerrit.wikimedia.org/r/320696 (https://phabricator.wikimedia.org/T93927) (owner: 10BBlack) [22:24:02] (03CR) 10BBlack: [C: 032] Revert "tlsproxy: experimental support for internal ocsp" [puppet] - 10https://gerrit.wikimedia.org/r/320697 (https://phabricator.wikimedia.org/T93927) (owner: 10BBlack) [22:24:42] Firefox doesn't block the other Javascript, but still no add statement buttons [22:24:50] quick question all - looking at a user's contributions, a capital F has appeared between article name and edit tag, such as (+12)‎ . . Joke ‎ (F) (Tags: Mobile edit, Mobile web edit). What's it for ? [22:25:09] PROBLEM - check_mysql on frdb1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 3378 [22:25:33] sjoerddebruin i use firefox all the default buttons on wikidata work fine for me [22:25:46] sjoerddebruin: can you open up https://www.wikidata.org/w/load.php?modules=ext.popups.core and look for "removeTooltips" [22:25:47] Zppix: do you have the hovercards beta feature enabled? [22:26:00] bmansurov: none found [22:26:08] sjoerddebruin i dont use any beta features within wikidata and the only one i do use is ores on enwiki [22:26:17] sjoerddebruin: an old file is being served then [22:26:32] Can we force a new one? [22:26:36] bmansurov let me attempt to try it [22:27:11] sjoerddebruin: i don't know how, someone else here should know [22:27:34] bmansurov i tried it but all i get is a big wall of code [22:27:34] Zppix: try to force a new file? [22:27:43] no i meant acces that link [22:28:19] Zppix: yes, I also get that [22:28:56] Zppix: try https://www.wikidata.org/w/load.php?modules=ext.popups.core&debug=true&lang=en&only=scripts&skin=vector [22:29:03] same thing, not minimized [22:30:04] I'd like to update https://en.wikipedia.org/wiki/Wikipedia#Hardware_operations_and_support, the amount of servers and usage of Squid for front-end caching are incorrect. I'm not sure about the amount of requests, though, are there any statistics on that? [22:30:09] PROBLEM - check_mysql on frdb1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 3678 [22:30:52] (asking here because there's probably something in grafana about it? if not, I'll ask elsewhere) [22:31:07] thcipriani: do you know why wikidata maybe loading an old version of a JS module? [22:31:57] bmansurov considering wikidata is "special" i think it may use older modules due to how wikidata functions or such atleast thats what im assuming (and seems plausable to me) [22:32:34] Zppix: ok, let me know if I can help any further [22:32:52] I can look at the deployment server, but I don't have any answer offhand [22:33:40] ok thanks [22:34:07] SPF|Cloud, there is https://grafana.wikimedia.org/dashboard/db/varnish-traffic in grafana [22:34:58] Zppix: sjoerddebruin https://www.wikidata.org/w/load.php?modules=ext.popups.targets.desktopTarget&debug=true&lang=en&only=scripts&skin=vector is the correct file [22:35:06] which is causing the issue [22:35:09] RECOVERY - check_mysql on frdb1001 is OK: Uptime: 92268 Threads: 1 Questions: 3975124 Slow queries: 684 Opens: 939 Flush tables: 1 Open tables: 581 Queries per second avg: 43.082 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [22:35:44] but, that one is also up-to-date [22:35:47] when accessed directly [22:36:29] Krenair: yeah, I saw that as well, but it's just traffic in Gbps [22:37:07] Still don't see others with the same problem. :( [22:37:33] hrm, so it seems that Wikidata checked out to the correct commit on the deployment hosts. https://github.com/wikimedia/mediawiki-tools-release/blob/master/make-wmf-branch/config.json#L172 is the correct branch, current commit on the deployment host is the head of that branch: https://github.com/wikimedia/mediawiki-extensions-Wikidata/tree/wmf/1.28.0-wmf.23 [22:38:27] https://grafana.wikimedia.org/dashboard/db/varnish-caching?panelId=5&fullscreen&var-cluster=text&var-site=All this looks like something I'm searching for [22:38:44] SPF|Cloud, https://grafana.wikimedia.org/dashboard/db/varnish-http-requests [22:39:42] is that per minute? [22:40:09] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1338 [22:40:12] bmansurov (i forgot to tell you since your probably new to this channel, but im actually not apart of operations team im just usually in here helping out when able/needed and because im a volunteer dev and i like keeping up with server statuses) [22:41:00] good to know [22:41:59] the text clusters of the two caching datacenters seem to have codfw as backend (for one caching DC since today), so I guess for the best report I'd wait a few days and then find something that excludes codfw? [22:42:19] 06Operations, 13Patch-For-Review: mgmt hosts that exist but don't resolve to an IP - https://phabricator.wikimedia.org/T149875#2784085 (10Papaul) p:05Triage>03Normal [22:42:34] best ask bblack [22:43:27] will ask him tomorrow (unless he can quickly help me with an answer). Raw Request Rate - text @ eqiad + esams + ulsfo seems right to me, but perhaps I'm wrong [22:43:55] 06Operations, 10ops-codfw, 10EventBus: rack/setup kafka2003 - https://phabricator.wikimedia.org/T150340#2784086 (10Papaul) [22:48:00] 06Operations, 10ops-codfw, 10EventBus, 10netops: kafka2003 switch port configuration - https://phabricator.wikimedia.org/T150380#2784090 (10Papaul) [22:48:29] 06Operations, 10ops-codfw, 10EventBus: rack/setup kafka2003 - https://phabricator.wikimedia.org/T150340#2782678 (10Papaul) [22:48:44] 06Operations, 06Discovery, 10Elasticsearch, 10hardware-requests, 06Discovery-Search (Current work): elasticsearch new servers (5x eqiad / 12x codfw) - https://phabricator.wikimedia.org/T149089#2784105 (10EBernhardson) memory to core count isn't a big deal for our elasticsearch cluster, for the most part... [22:49:00] SPF|Cloud, I read that part of the article you linke [22:49:01] linked [22:49:13] it's also worth mentioning that LVS sits in front of the caches (now varnish, not squid) [22:49:31] yup [22:49:41] and also where it says "mainly Ubuntu", I think it's mainly Debian now [22:49:54] also correct [22:50:05] and pmtpa/sdtpa are gone as well [22:50:24] 06Operations, 06Discovery, 10Elasticsearch, 10hardware-requests, 06Discovery-Search (Current work): elasticsearch new servers (5x eqiad / 12x codfw) - https://phabricator.wikimedia.org/T149089#2784109 (10EBernhardson) a:05Gehel>03RobH [22:55:09] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1463 [22:55:18] SPF|Cloud, if we're talking about the diagram, yes... TS is gone, mobile and testwiki are handled much more normally, secure is gone, nagios has been replaced (now icinga, behind an auth check), don't know what 'project' is, BZ is gone, blogs is third-party now, SVN is gone, IMAP I think is gone, don't know what scratch is [22:56:04] (03PS4) 10Filippo Giunchedi: role: use single ferm rule for prometheus node/varnish exporter [puppet] - 10https://gerrit.wikimedia.org/r/320622 [22:56:05] databases doesn't account for external storage which is actually what stores wikitext [22:56:30] memc/redis/etc. aren't on there, labs isn't on there, image stuff has been replaced with swift [22:57:20] (03CR) 10Filippo Giunchedi: [C: 032] role: use single ferm rule for prometheus node/varnish exporter [puppet] - 10https://gerrit.wikimedia.org/r/320622 (owner: 10Filippo Giunchedi) [22:57:42] PDF stuff has been replaced and is backend [22:57:51] probably needs a new diagram [22:57:58] I think there's a newer one on wikitech somewhere [23:00:14] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1763 [23:00:36] SPF|Cloud, https://wikitech.wikimedia.org/w/images/4/4d/Infrastructure_overview.png [23:01:57] https://phabricator.wikimedia.org/T143892 [23:03:01] this doesn't show a lot of the misc services, video scalers are also missing [23:03:08] databases doesn't have as much detail [23:05:09] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1994 [23:07:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [23:08:39] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [23:10:09] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2294 [23:14:33] (03PS1) 10Dzahn: (WIP) create eggdrop module [puppet] - 10https://gerrit.wikimedia.org/r/320698 [23:15:09] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2594 [23:20:09] RECOVERY - check_mysql on lutetium is OK: Uptime: 191809 Threads: 5 Questions: 9415792 Slow queries: 1066 Opens: 682438 Flush tables: 2 Open tables: 64 Queries per second avg: 49.089 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [23:23:36] 06Operations, 10Cassandra, 06Services, 13Patch-For-Review: Change graphite aggregation function for cassandra 'count' metrics - https://phabricator.wikimedia.org/T121789#2784203 (10fgiunchedi) Looks like it is working on xenon: https://graphite.wikimedia.org/render/?width=586&height=294&_salt=1478732584.35... [23:25:49] !log silence lutetium flapping check_mysql for two days [23:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:30] I've mailed Jeff about it [23:28:53] :) thanks [23:38:37] !log update cassandra aggregation scheme for 'count' metrics - T121789 [23:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:43] T121789: Change graphite aggregation function for cassandra 'count' metrics - https://phabricator.wikimedia.org/T121789 [23:42:50] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2784282 (10GWicke) IRC meeting notes discussing this RFC: E355 Next steps: 1) look into which users rely on the current thumb format 2) investigate e... [23:44:20] 06Operations, 06Performance-Team, 10Thumbor: Record OOM kills as a metric with mtail - https://phabricator.wikimedia.org/T148962#2784295 (10fgiunchedi) [23:44:23] 06Operations, 06Performance-Team, 10Thumbor: Investigate why oom_kill mtail program doesn't work properly - https://phabricator.wikimedia.org/T149980#2784293 (10fgiunchedi) 05Open>03Resolved I've commented some of the more fine-grained EDAC regexps in https://gerrit.wikimedia.org/r/#/c/320631/ until the... [23:48:29] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:48:33] (03PS1) 10Filippo Giunchedi: prometheus: add memcached_exporter [puppet] - 10https://gerrit.wikimedia.org/r/320702 (https://phabricator.wikimedia.org/T147326) [23:48:41] (03PS2) 10Dzahn: (WIP) create eggdrop module [puppet] - 10https://gerrit.wikimedia.org/r/320698 [23:49:49] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2784337 (10GWicke)