[00:00:03] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [00:05:26] (03CR) 10Aaron Schulz: [C: 032] rpc: raise exception instead of die [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328182 (owner: 10Hashar) [00:07:48] (03CR) 10Krinkle: Use EtcdConfig (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [00:10:21] (03PS5) 10Tim Starling: Use EtcdConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) [00:10:25] (03Merged) 10jenkins-bot: rpc: raise exception instead of die [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328182 (owner: 10Hashar) [00:10:33] (03CR) 10jenkins-bot: rpc: raise exception instead of die [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328182 (owner: 10Hashar) [00:14:04] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:14:24] RECOVERY - Check Varnish expiry mailbox lag on cp2014 is OK: OK: expiry mailbox lag is 0 [00:14:53] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:16:40] 06Operations, 10hardware-requests, 15User-fgiunchedi: Additional ram quote for Prometheus baremetal - https://phabricator.wikimedia.org/T161606#3196238 (10faidon) [00:18:55] (03CR) 10Tim Starling: Use EtcdConfig (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [00:23:50] 06Operations: appserver fatals - intermittent failed connections to rdb2005 - https://phabricator.wikimedia.org/T163405#3196264 (10Dzahn) [00:28:10] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/6174/ (failure on tin is not related, still shows how it works on baham and doesnt affect others)" [puppet] - 10https://gerrit.wikimedia.org/r/348976 (https://phabricator.wikimedia.org/T163220) (owner: 10Dzahn) [00:28:13] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:30:06] (03PS3) 10Dzahn: base: add icinga check for CPU frequency on Dell R320 [puppet] - 10https://gerrit.wikimedia.org/r/348976 (https://phabricator.wikimedia.org/T163220) [00:32:22] (03PS4) 10Dzahn: base: add icinga check for CPU frequency on Dell R320 [puppet] - 10https://gerrit.wikimedia.org/r/348976 (https://phabricator.wikimedia.org/T163220) [00:35:33] RECOVERY - Router interfaces on cr2-knams is OK: OK: host 91.198.174.246, interfaces up: 57, down: 0, dormant: 0, excluded: 0, unused: 0 [00:37:03] PROBLEM - citoid endpoints health on scb2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:38:03] RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy [00:44:13] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/check_cpufreq] [00:45:03] PROBLEM - puppet last run on logstash1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/check_cpufreq] [00:45:19] (03PS1) 10Dzahn: base/icinga: fix path to check_cpufreq plugin [puppet] - 10https://gerrit.wikimedia.org/r/349142 [00:45:52] (03PS2) 10Dzahn: base/icinga: fix path to check_cpufreq plugin [puppet] - 10https://gerrit.wikimedia.org/r/349142 [00:46:03] PROBLEM - puppet last run on achernar is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/check_cpufreq] [00:47:02] (03CR) 10Dzahn: [C: 032] base/icinga: fix path to check_cpufreq plugin [puppet] - 10https://gerrit.wikimedia.org/r/349142 (owner: 10Dzahn) [00:49:24] PROBLEM - puppet last run on logstash1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/check_cpufreq] [00:50:54] PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/check_cpufreq] [00:54:03] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [00:55:11] (03PS5) 10Dzahn: dnsrec/icinga: add child/parent rel between monitor hosts [puppet] - 10https://gerrit.wikimedia.org/r/347984 [00:55:17] TimStarling: interesting, I did figure it was some random filename preserved for back-compat, but didn't know it was *that* old, or that it was once a secret key [00:55:18] https://github.com/wikimedia/mediawiki/blob/d82c14fb4fbac288b42ca5918b0a72f33ecb1e69/includes/DefaultSettings.php#L47-L48 [00:55:21] first svn commit [00:56:12] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [00:57:27] 06Operations, 10netops: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#3196333 (10faidon) This just got in: > The fix for PR 1238906 has implemented in the main Junos release and would be available 14.1X53-D43 onwards. This releas... [00:58:52] and people say $wgLegacyEncoding is old [00:59:19] got a meeting now [01:09:24] actually it was cancelled [01:09:46] 06Operations, 07Performance: icinga/grafana: webpagetest-alerts is alerting: Desktop Internet Explorer render issues - https://phabricator.wikimedia.org/T163408#3196344 (10Dzahn) [01:10:48] I remember someone asking Lee on IRC about the debug log filename, he said that the random letters were meant to stop people from guessing the name and downloading the log [01:11:23] this is a very pragmatic kind of security in which attackers don't bother to read your source files [01:11:50] so I assume the lock file was chosen for the same reason, they are on adjacent lines in that initial commit [01:12:12] RECOVERY - puppet last run on logstash1001 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [01:12:13] 06Operations, 07Performance: (icinga/grafana) webpagetest-alerts: slow page rendering for Internet Explorer - https://phabricator.wikimedia.org/T163408#3196359 (10Dzahn) [01:13:36] Krinkle: that's not quite the earliest version, actually [01:13:48] 06Operations, 07Performance: (icinga/grafana) webpagetest-alerts: slow page rendering for Internet Explorer - https://phabricator.wikimedia.org/T163408#3196344 (10Dzahn) [01:13:57] the initial work on MW was done in /trunk/phpwiki/newcodebase [01:14:02] where we have: [01:14:02] RECOVERY - puppet last run on achernar is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [01:14:06] $wgReadOnlyFile = "/usr/local/apache/htdocs/upload/dblockflag838942"; [01:14:27] this is from 2002 [01:14:55] no debug log back then, only $wgDebugComments [01:16:01] 06Operations, 10Monitoring, 13Patch-For-Review: Add Icinga check for CPU frequency on Dell R320 - https://phabricator.wikimedia.org/T163220#3196363 (10Dzahn) https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=CPU+Freq [01:16:23] RECOVERY - puppet last run on logstash1002 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [01:19:53] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [01:28:03] !log ran puppet on all (16) Dell R320 via cumin to add CPU frequency check [01:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:32:05] 06Operations, 10Monitoring, 07Performance: (icinga/grafana) webpagetest-alerts: slow page rendering for Internet Explorer - https://phabricator.wikimedia.org/T163408#3196368 (10Dzahn) [01:35:55] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2256 is CRITICAL: Host mw2256 is not in mediawiki-installation dsh group daniel_zahn https://phabricator.wikimedia.org/T163346 [01:37:13] !log mw2150 - restarted hhvm (had 'thread leakage' alert) [01:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:45:13] 06Operations, 13Patch-For-Review: acpi_pad issues - https://phabricator.wikimedia.org/T162850#3196376 (10Dzahn) [01:45:15] 06Operations, 10Monitoring, 13Patch-For-Review: Add Icinga check for CPU frequency on Dell R320 - https://phabricator.wikimedia.org/T163220#3196374 (10Dzahn) 05Open>03Resolved {F7653575} [01:51:23] (03CR) 10Dzahn: [C: 032] dnsrec/icinga: add child/parent rel between monitor hosts [puppet] - 10https://gerrit.wikimedia.org/r/347984 (owner: 10Dzahn) [02:00:12] PROBLEM - check_missing_thank_yous on db1025 is CRITICAL: CRITICAL missing_thank_yous=2533 [critical =2000] [02:02:45] bblack: remember when you reinstalled dnsrecursor and you had downtimed everything but then we still got the other "host down". those virtual IP "hosts"? ^ that should fix this issue. we taught Icinga that the real hosts are the "parents" of the virtual hosts, and if the parent is down Icinga would call the child "UNREACHABLE" but not "DOWN" because knows why. and that should then mean the [02:02:51] bot doesnt spam us because different status it is not configured to output [02:03:24] (that's for later, not realtime) [02:50:58] 06Operations, 10Monitoring: Check for an oversized exim4 queue indicating mail delivery failures - https://phabricator.wikimedia.org/T133110#3196473 (10Dzahn) a:03Dzahn [02:51:54] 06Operations, 10vm-requests, 13Patch-For-Review: Site: 2 VM request for tendril (switch tendril from einsteinium to dbmonitor*) - https://phabricator.wikimedia.org/T149557#3196474 (10Dzahn) a:05akosiaris>03Dzahn [02:56:27] TimStarling: Yeah, I'm aware of phase2 and the php script. But still, earliest phase3 svn commit is still pretty 'early [03:48:27] 06Operations, 10fundraising-tech-ops: Revisit paging strategy for frack servers - https://phabricator.wikimedia.org/T163368#3196489 (10Dzahn) relevant puppet code: `modules/monitoring/manifests/service.pp` ``` 39 # If a service is set to critical and 40 # paging is not disabled for this machine in... [04:21:44] (03CR) 10Krinkle: [C: 04-1] "-1 for lack of caching of the http requests. But we should also see if it's viable to have these live in the same cache key as was intende" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [04:38:59] 06Operations, 10Monitoring, 06Performance-Team: (icinga/grafana) webpagetest-alerts: slow page rendering for Internet Explorer - https://phabricator.wikimedia.org/T163408#3196510 (10Krinkle) [04:39:27] 06Operations, 10fundraising-tech-ops: Revisit paging strategy for frack servers - https://phabricator.wikimedia.org/T163368#3196511 (10Dzahn) my suggestion would be: - (optional) rename group "sms" to "core-ops" (or maybe "core-ops-sms") since it specifies a list of people, not a notification method, or at th... [04:40:45] (03PS6) 10Tim Starling: Use EtcdConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) [04:52:33] 06Operations, 10Monitoring, 06Performance-Team: (icinga/grafana) webpagetest-alerts: slow page rendering for Internet Explorer - https://phabricator.wikimedia.org/T163408#3196344 (10Krinkle) WebPageTest alerts: (03PS1) 10Catrope: Set ORES thresholds in new format for all enabled wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349146 (https://phabricator.wikimedia.org/T162760) [04:53:57] (03CR) 10jerkins-bot: [V: 04-1] Set ORES thresholds in new format for all enabled wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349146 (https://phabricator.wikimedia.org/T162760) (owner: 10Catrope) [05:14:26] PROBLEM - cassandra-a SSL 10.64.0.32:7001 on restbase1016 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [05:15:26] PROBLEM - cassandra-a service on restbase1016 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [05:15:36] PROBLEM - Check systemd state on restbase1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:15:36] PROBLEM - cassandra-a CQL 10.64.0.32:9042 on restbase1016 is CRITICAL: connect to address 10.64.0.32 and port 9042: Connection refused [05:21:24] 06Operations, 10MediaWiki-Configuration, 06MediaWiki-Platform-Team, 06Performance-Team, and 9 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3196527 (10tstarling) [05:27:52] (03PS2) 10Catrope: Set ORES thresholds in new format for all enabled wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349146 (https://phabricator.wikimedia.org/T162760) [05:37:26] RECOVERY - cassandra-a service on restbase1016 is OK: OK - cassandra-a is active [05:37:36] RECOVERY - Check systemd state on restbase1016 is OK: OK - running: The system is fully operational [05:38:26] RECOVERY - cassandra-a SSL 10.64.0.32:7001 on restbase1016 is OK: SSL OK - Certificate restbase1016-a valid until 2017-12-13 00:15:49 +0000 (expires in 236 days) [05:38:36] RECOVERY - cassandra-a CQL 10.64.0.32:9042 on restbase1016 is OK: TCP OK - 0.000 second response time on 10.64.0.32 port 9042 [05:59:27] 06Operations, 10DBA, 13Patch-For-Review, 05codfw-rollout: codfw API slaves overloaded during the 2017-04-19 codfw switch - https://phabricator.wikimedia.org/T163351#3196540 (10Marostegui) db2062 and db2069 are in the same state as yesterday (but with a lot less IOPS than after the initial peak) so that is... [06:02:33] (03Abandoned) 10Marostegui: db-codfw.php: Depool db2069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348970 (https://phabricator.wikimedia.org/T163351) (owner: 10Marostegui) [06:04:06] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=445.10 Read Requests/Sec=601.60 Write Requests/Sec=3.60 KBytes Read/Sec=38016.00 KBytes_Written/Sec=82.40 [06:13:06] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=1.10 Read Requests/Sec=0.00 Write Requests/Sec=0.50 KBytes Read/Sec=0.00 KBytes_Written/Sec=3.60 [06:18:11] 06Operations, 10DBA, 05codfw-rollout: Pool db2071? - https://phabricator.wikimedia.org/T163413#3196542 (10Marostegui) [06:18:20] 06Operations, 10DBA, 05codfw-rollout: Pool new server db2071? - https://phabricator.wikimedia.org/T163413#3196556 (10Marostegui) [06:30:28] 06Operations, 10Traffic, 13Patch-For-Review: Huge increase in cache_upload 404s due to buggy client-side code from graphiq.com - https://phabricator.wikimedia.org/T151444#3196558 (10ema) 05Open>03Resolved a:03ema 404 rate [[https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-code... [06:31:09] (03Abandoned) 10Ema: cache_upload: stop graphiq.com buggy javascript [puppet] - 10https://gerrit.wikimedia.org/r/323135 (https://phabricator.wikimedia.org/T151444) (owner: 10Ema) [06:34:39] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [06:37:49] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [06:39:58] !log Deploy alter table on s4.oldimage on eqiad master db1040 (this will create lag on eqiad - all hosts have been silenced) - T73563 [06:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:08] T73563: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563 [06:52:40] (03CR) 10Marostegui: "Do we want to deploy this in the end?" [dns] - 10https://gerrit.wikimedia.org/r/348440 (https://phabricator.wikimedia.org/T155099) (owner: 10Marostegui) [07:02:39] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:03:47] 06Operations, 10DBA, 05codfw-rollout: Pool new server db2071 - https://phabricator.wikimedia.org/T163413#3196599 (10jcrespo) [07:03:50] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:04:13] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: codfw rack/setup 22 DB servers - https://phabricator.wikimedia.org/T162159#3196601 (10jcrespo) [07:04:15] 06Operations, 10DBA, 05codfw-rollout: Pool new server db2071 - https://phabricator.wikimedia.org/T163413#3196542 (10jcrespo) [07:04:46] (03PS1) 10Jcrespo: Setup db2071 as new s1 slave [puppet] - 10https://gerrit.wikimedia.org/r/349158 (https://phabricator.wikimedia.org/T163413) [07:06:17] (03CR) 10Marostegui: [C: 031] Setup db2071 as new s1 slave [puppet] - 10https://gerrit.wikimedia.org/r/349158 (https://phabricator.wikimedia.org/T163413) (owner: 10Jcrespo) [07:07:17] (03CR) 10Jcrespo: [C: 032] Setup db2071 as new s1 slave [puppet] - 10https://gerrit.wikimedia.org/r/349158 (https://phabricator.wikimedia.org/T163413) (owner: 10Jcrespo) [07:12:04] !log Deploy alter table on s4.image on eqiad master db1040 (this will create lag on eqiad - all hosts have been silenced) - https://phabricator.wikimedia.org/T73563 [07:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:02] (03PS1) 10Marostegui: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349160 (https://phabricator.wikimedia.org/T132416) [07:27:48] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349160 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [07:28:57] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349160 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [07:29:05] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349160 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [07:31:54] !log marostegui@naos Synchronized wmf-config/db-eqiad.php: Depool db1065 - T132416 (duration: 02m 18s) [07:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:03] T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416 [07:36:31] (03PS1) 10Jcrespo: Depool db1083 for cloning and pre-setup new server db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349162 (https://phabricator.wikimedia.org/T163413) [07:37:47] (03PS2) 10Jcrespo: Depool db1080 for cloning and pre-setup new server db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349162 (https://phabricator.wikimedia.org/T163413) [07:38:14] 06Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3196640 (10elukey) @Papaul: we replaced the one of the RAM banks on mw2256 a while ago, it might be possible that we are more similar issues? [07:39:11] (03CR) 10Marostegui: [C: 031] Depool db1080 for cloning and pre-setup new server db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349162 (https://phabricator.wikimedia.org/T163413) (owner: 10Jcrespo) [07:44:20] (03CR) 10Jcrespo: [C: 032] Depool db1080 for cloning and pre-setup new server db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349162 (https://phabricator.wikimedia.org/T163413) (owner: 10Jcrespo) [07:44:26] (03PS1) 10Alexandros Kosiaris: puppetmaster: Depool puppetmaster1002 [puppet] - 10https://gerrit.wikimedia.org/r/349164 (https://phabricator.wikimedia.org/T148506) [07:46:21] (03Merged) 10jenkins-bot: Depool db1080 for cloning and pre-setup new server db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349162 (https://phabricator.wikimedia.org/T163413) (owner: 10Jcrespo) [07:46:29] (03CR) 10jenkins-bot: Depool db1080 for cloning and pre-setup new server db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349162 (https://phabricator.wikimedia.org/T163413) (owner: 10Jcrespo) [07:52:26] 06Operations: Four different PHP/HHVM versions on the cluster - https://phabricator.wikimedia.org/T163278#3196658 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff I've upgraded mwdebug to also use 3.18.2. terbium will be reimaged to jessie next week (initially it will use HHVM 3.12 and it'll b... [07:53:04] !log Deploy alter table enwiki.revision db1065 - https://phabricator.wikimedia.org/T132416 [07:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:23] !log jynus@naos Synchronized wmf-config/db-eqiad.php: Depool db1080 (duration: 01m 02s) [07:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:56] did you depool db1067? [07:54:57] !log jynus@naos Synchronized wmf-config/db-codfw.php: Add db2071, depooled (duration: 00m 53s) [07:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:19] jynus: i depooled db1065 [07:55:36] I do not see any of our changes on noc [07:55:40] and pooled db1067 for vslow, dump [07:55:42] mmm [07:55:51] 06Operations, 10ops-eqiad, 10netops: Spread eqiad analytics Kafka nodes to multiple racks ans rows - https://phabricator.wikimedia.org/T163002#3196663 (10elukey) [07:56:21] I do see it [07:56:25] maybe you have it cached? [07:56:40] anything can be [07:56:42] 06Operations, 10ops-eqiad, 10netops: Spread eqiad analytics Kafka nodes to multiple racks ans rows - https://phabricator.wikimedia.org/T163002#3183025 (10elukey) >>! In T163002#3193747, @ayounsi wrote: >>if possible to migrate kafka1022 > I believe you mean kafka1020 Definitely, fixed the task's description... [07:56:44] https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php&hello [07:57:25] I see it now, the browser had cached it [07:57:30] :) [07:59:13] !log shutting down db1080 for cloning and upgrade T163413 [07:59:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:20] T163413: Pool new server db2071 - https://phabricator.wikimedia.org/T163413 [08:02:23] (03PS1) 10Gehel: logstash - raise elasticsearch shard alert threshold to 34 [puppet] - 10https://gerrit.wikimedia.org/r/349168 [08:06:54] (03CR) 10Hashar: [C: 031] Jenkins: install jdk, not just jre (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/348961 (owner: 10Chad) [08:13:18] (03PS1) 10Alexandros Kosiaris: Remove neon hieradata, host does not exist anymore [puppet] - 10https://gerrit.wikimedia.org/r/349169 [08:13:20] (03PS1) 10Alexandros Kosiaris: Refactor role::icinga slightly [puppet] - 10https://gerrit.wikimedia.org/r/349170 [08:13:22] (03PS1) 10Alexandros Kosiaris: Switch einsteinium and tegmen roles [puppet] - 10https://gerrit.wikimedia.org/r/349171 (https://phabricator.wikimedia.org/T163323) [08:20:51] 06Operations, 10DBA, 13Patch-For-Review, 05codfw-rollout: Pool new server db2071 - https://phabricator.wikimedia.org/T163413#3196712 (10Marostegui) db2072 is also now ready to be used if needed. All went fine after rebooting it. [08:21:30] (03CR) 10Alexandros Kosiaris: [C: 032] Remove neon hieradata, host does not exist anymore [puppet] - 10https://gerrit.wikimedia.org/r/349169 (owner: 10Alexandros Kosiaris) [08:22:27] (03CR) 10DCausse: logstash - raise elasticsearch shard alert threshold to 34 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/349168 (owner: 10Gehel) [08:23:34] (03CR) 10Gehel: logstash - raise elasticsearch shard alert threshold to 34 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/349168 (owner: 10Gehel) [08:25:21] (03PS2) 10Alexandros Kosiaris: Refactor role::icinga slightly [puppet] - 10https://gerrit.wikimedia.org/r/349170 [08:25:23] (03PS2) 10Alexandros Kosiaris: Switch einsteinium and tegmen roles [puppet] - 10https://gerrit.wikimedia.org/r/349171 (https://phabricator.wikimedia.org/T163323) [08:30:58] (03PS1) 10Jcrespo: prometheus-mysqld-exporter: Add new node db2071 [puppet] - 10https://gerrit.wikimedia.org/r/349173 (https://phabricator.wikimedia.org/T163413) [08:31:14] 06Operations, 06DC-Ops, 10Traffic, 10netops, 13Patch-For-Review: Interface errors on asw-c-codfw:xe-7/0/46 - https://phabricator.wikimedia.org/T163323#3193493 (10akosiaris) Wrong patch above, please ignore. [08:32:01] (03PS3) 10Alexandros Kosiaris: Refactor role::icinga slightly [puppet] - 10https://gerrit.wikimedia.org/r/349170 [08:32:03] (03PS3) 10Alexandros Kosiaris: Switch einsteinium and tegmen roles [puppet] - 10https://gerrit.wikimedia.org/r/349171 (https://phabricator.wikimedia.org/T163324) [08:32:55] (03PS2) 10Jcrespo: prometheus-mysqld-exporter: Add new node db2071 [puppet] - 10https://gerrit.wikimedia.org/r/349173 (https://phabricator.wikimedia.org/T163413) [08:33:36] (03PS2) 10Gehel: logstash - raise elasticsearch shard alert threshold to 34 [puppet] - 10https://gerrit.wikimedia.org/r/349168 [08:34:28] (03PS4) 10Alexandros Kosiaris: Refactor role::icinga slightly [puppet] - 10https://gerrit.wikimedia.org/r/349170 [08:34:31] (03PS4) 10Alexandros Kosiaris: Switch einsteinium and tegmen roles [puppet] - 10https://gerrit.wikimedia.org/r/349171 (https://phabricator.wikimedia.org/T163324) [08:38:45] 06Operations, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner: jobqueue is full of refreshlinks duplicates after the switchover. - https://phabricator.wikimedia.org/T163418#3196736 (10Joe) [08:39:00] 06Operations, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner: jobqueue is full of refreshlinks duplicates after the switchover. - https://phabricator.wikimedia.org/T163418#3196751 (10Joe) p:05Triage>03Unbreak! [08:41:38] 06Operations, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner: jobqueue is full of refreshlinks duplicates after the switchover. - https://phabricator.wikimedia.org/T163418#3196768 (10Joe) [08:46:28] (03CR) 10Gehel: logstash - raise elasticsearch shard alert threshold to 34 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/349168 (owner: 10Gehel) [08:47:38] <_joe_> !log live-patching ./includes/jobqueue/jobs/RefreshLinksJob.php to drop all recursive jobs, T163418 [08:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:46] T163418: jobqueue is full of refreshlinks duplicates after the switchover. - https://phabricator.wikimedia.org/T163418 [08:47:49] (03PS3) 10Gehel: logstash - raise elasticsearch shard alert threshold to 34 [puppet] - 10https://gerrit.wikimedia.org/r/349168 [08:50:10] (03CR) 10DCausse: logstash - raise elasticsearch shard alert threshold to 34 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/349168 (owner: 10Gehel) [08:51:08] (03PS1) 10Marostegui: s1.hosts: Add db2071 [software] - 10https://gerrit.wikimedia.org/r/349175 (https://phabricator.wikimedia.org/T163413) [08:51:59] (03CR) 10jerkins-bot: [V: 04-1] s1.hosts: Add db2071 [software] - 10https://gerrit.wikimedia.org/r/349175 (https://phabricator.wikimedia.org/T163413) (owner: 10Marostegui) [08:52:31] uh= [08:52:33] ? [08:53:11] PROBLEM - HHVM jobrunner on mw2153 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1703 bytes in 0.074 second response time [08:55:28] ^looking [08:55:31] (03PS1) 10Alexandros Kosiaris: puppet_compiler: Use dirname in the conftool namespace [puppet] - 10https://gerrit.wikimedia.org/r/349176 [08:56:26] (03PS2) 10Alexandros Kosiaris: puppet_compiler: Use dirname in the conftool namespace [puppet] - 10https://gerrit.wikimedia.org/r/349176 [08:56:31] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] puppet_compiler: Use dirname in the conftool namespace [puppet] - 10https://gerrit.wikimedia.org/r/349176 (owner: 10Alexandros Kosiaris) [09:00:06] (03CR) 10Marostegui: [C: 031] prometheus-mysqld-exporter: Add new node db2071 [puppet] - 10https://gerrit.wikimedia.org/r/349173 (https://phabricator.wikimedia.org/T163413) (owner: 10Jcrespo) [09:09:23] (03PS5) 10Alexandros Kosiaris: Refactor role::icinga slightly [puppet] - 10https://gerrit.wikimedia.org/r/349170 [09:09:25] (03PS5) 10Alexandros Kosiaris: Switch einsteinium and tegmen roles [puppet] - 10https://gerrit.wikimedia.org/r/349171 (https://phabricator.wikimedia.org/T163324) [09:09:43] (03CR) 10Volans: Fix configuration of size limits to allow paged LDAP search requests [puppet] - 10https://gerrit.wikimedia.org/r/348920 (https://phabricator.wikimedia.org/T162745) (owner: 10Muehlenhoff) [09:09:53] 06Operations, 10DBA, 13Patch-For-Review, 05codfw-rollout: Pool new server db2071 - https://phabricator.wikimedia.org/T163413#3196888 (10Marostegui) More servers are now online and ready to be used if needed, I will post it on the original tracking task (T162159) so I don't hijack this one. [09:12:47] (03CR) 10Jcrespo: [C: 031] s1.hosts: Add db2071 [software] - 10https://gerrit.wikimedia.org/r/349175 (https://phabricator.wikimedia.org/T163413) (owner: 10Marostegui) [09:13:31] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: codfw rack/setup 22 DB servers - https://phabricator.wikimedia.org/T162159#3196923 (10Marostegui) @Papaul the following servers were ready to get puppet enabled and all that, so I did so, and rebooted them db2071 db2072 db2073 db2075 db2076 db2079 db2... [09:14:30] <_joe_> !log scap pull of live hack for T163418 on mw2154 [09:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:38] T163418: jobqueue is full of refreshlinks duplicates after the switchover. - https://phabricator.wikimedia.org/T163418 [09:15:40] 06Operations, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 13Patch-For-Review: jobqueue is full of refreshlinks duplicates after the switchover. - https://phabricator.wikimedia.org/T163418#3196927 (10Joe) I'm testing this live hack: ``` diff --git a/includes/jobqueue/jobs/RefreshLinksJob.php b/includes/jo... [09:16:52] PROBLEM - HHVM jobrunner on mw2154 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1703 bytes in 0.078 second response time [09:17:57] <_joe_> !log removed the live hack, running scap pull again on mw2154 [09:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:49] (03CR) 10Marostegui: "recheck" [software] - 10https://gerrit.wikimedia.org/r/349175 (https://phabricator.wikimedia.org/T163413) (owner: 10Marostegui) [09:26:24] (03CR) 10Alexandros Kosiaris: [C: 032] Refactor role::icinga slightly [puppet] - 10https://gerrit.wikimedia.org/r/349170 (owner: 10Alexandros Kosiaris) [09:26:38] (03CR) 10Alexandros Kosiaris: [C: 032] "PCC happy at https://puppet-compiler.wmflabs.org/6179/, merging" [puppet] - 10https://gerrit.wikimedia.org/r/349170 (owner: 10Alexandros Kosiaris) [09:26:42] (03PS6) 10Alexandros Kosiaris: Refactor role::icinga slightly [puppet] - 10https://gerrit.wikimedia.org/r/349170 [09:26:48] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Refactor role::icinga slightly [puppet] - 10https://gerrit.wikimedia.org/r/349170 (owner: 10Alexandros Kosiaris) [09:26:52] (03CR) 10jerkins-bot: [V: 04-1] s1.hosts: Add db2071 [software] - 10https://gerrit.wikimedia.org/r/349175 (https://phabricator.wikimedia.org/T163413) (owner: 10Marostegui) [09:26:54] (03PS4) 10Gehel: logstash - raise elasticsearch shard alert threshold to 34 [puppet] - 10https://gerrit.wikimedia.org/r/349168 [09:27:17] (03PS1) 10Hashar: Revert "rpc: raise exception instead of die" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349181 [09:27:25] (03PS2) 10Hashar: Revert "rpc: raise exception instead of die" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349181 [09:27:33] (03CR) 10Hashar: [C: 032] Revert "rpc: raise exception instead of die" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349181 (owner: 10Hashar) [09:31:37] (03Merged) 10jenkins-bot: Revert "rpc: raise exception instead of die" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349181 (owner: 10Hashar) [09:31:48] (03CR) 10jenkins-bot: Revert "rpc: raise exception instead of die" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349181 (owner: 10Hashar) [09:34:45] (03PS1) 10Alexandros Kosiaris: Switchover icinga.wikimedia.org to tegmen [dns] - 10https://gerrit.wikimedia.org/r/349184 (https://phabricator.wikimedia.org/T163324) [09:34:47] (03PS1) 10Hashar: rpc: raise exception instead of die [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349185 [09:34:54] !log hashar@naos Synchronized rpc/RunJobs.php: Revert "rpc: raise exception instead of die" - causes monitoring spam (duration: 01m 20s) [09:34:57] (03CR) 10Hashar: "Reverted and sent again as https://gerrit.wikimedia.org/r/349185" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328182 (owner: 10Hashar) [09:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:11] RECOVERY - HHVM jobrunner on mw2153 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.073 second response time [09:35:20] <_joe_> ^^ see :) [09:35:28] <_joe_> ok [09:35:37] <_joe_> I'm gonna do my live-hack now [09:35:51] RECOVERY - HHVM jobrunner on mw2154 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.073 second response time [09:36:51] (03CR) 10Hashar: [C: 04-2] "Icinga does check /rpc/RunJobs.php from the Icinga host (not via NRPE). Hence the probe requests ends up with a 500 caused by the exceptio" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349185 (owner: 10Hashar) [09:38:03] <_joe_> !log live-hack redeployed, running scap pull on codfw jobrunners [09:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:18] <_joe_> !log live-hack redeployed, running scap pull on codfw jobrunners T163418 [09:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:26] T163418: jobqueue is full of refreshlinks duplicates after the switchover. - https://phabricator.wikimedia.org/T163418 [09:41:47] (03PS1) 10Alexandros Kosiaris: tcpircbot: Subscribe to the correct File resource [puppet] - 10https://gerrit.wikimedia.org/r/349186 [09:42:55] (03PS2) 10Elukey: Set Xms value for the Hadoop Yarn Resource Manager's JVM [puppet] - 10https://gerrit.wikimedia.org/r/348915 (https://phabricator.wikimedia.org/T159219) [09:44:03] 06Operations, 10Traffic, 13Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#3197012 (10ema) The vast majority of objects ending up in bin0 (<16K) are actually very small: most are < 1.5K. Thus, they might be causing contention/fragmentat... [09:45:25] (03CR) 10Alexandros Kosiaris: [C: 032] Switch einsteinium and tegmen roles [puppet] - 10https://gerrit.wikimedia.org/r/349171 (https://phabricator.wikimedia.org/T163324) (owner: 10Alexandros Kosiaris) [09:45:30] (03PS6) 10Alexandros Kosiaris: Switch einsteinium and tegmen roles [puppet] - 10https://gerrit.wikimedia.org/r/349171 (https://phabricator.wikimedia.org/T163324) [09:45:33] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Switch einsteinium and tegmen roles [puppet] - 10https://gerrit.wikimedia.org/r/349171 (https://phabricator.wikimedia.org/T163324) (owner: 10Alexandros Kosiaris) [09:47:02] (03CR) 10Volans: [C: 031] "The change look sane, I'd try to find more time later to look at the whole project." [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/342444 (owner: 10Giuseppe Lavagetto) [09:47:43] !log running the cleanup script for ores_classification in enwiki [09:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:55] (03CR) 10Volans: [C: 031] "But please ensure that jenkins runs the tests" [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/342444 (owner: 10Giuseppe Lavagetto) [09:51:26] The lag is strangely big even though I do "\MediaWiki\MediaWikiServices::getInstance()->getDBLoadBalancerFactory()->waitForReplication();" in each batch and wait for additional 15 seconds just to be sure [09:51:30] (03CR) 10Volans: [C: 031] "LGMT, let jenkins run the test" [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/342445 (owner: 10Giuseppe Lavagetto) [09:51:32] maybe the batch is too big [09:51:48] 06Operations, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 13Patch-For-Review, 05codfw-rollout: jobqueue is full of refreshlinks duplicates after the switchover. - https://phabricator.wikimedia.org/T163418#3197025 (10Aklapper) [09:51:50] (03PS3) 10Ema: Revert "cache_upload: override CT updates on 304s" [puppet] - 10https://gerrit.wikimedia.org/r/348699 (https://phabricator.wikimedia.org/T162035) [09:51:51] (03PS1) 10Ema: cache_upload: don't cache tiny objects at the backend layer [puppet] - 10https://gerrit.wikimedia.org/r/349188 (https://phabricator.wikimedia.org/T145661) [09:53:32] (03PS3) 10Jcrespo: prometheus-mysqld-exporter: Add new node db2071 [puppet] - 10https://gerrit.wikimedia.org/r/349173 (https://phabricator.wikimedia.org/T163413) [09:53:39] (03CR) 10Elukey: [C: 032] Set Xms value for the Hadoop Yarn Resource Manager's JVM [puppet] - 10https://gerrit.wikimedia.org/r/348915 (https://phabricator.wikimedia.org/T159219) (owner: 10Elukey) [09:53:44] (03CR) 10Jcrespo: [V: 032 C: 032] prometheus-mysqld-exporter: Add new node db2071 [puppet] - 10https://gerrit.wikimedia.org/r/349173 (https://phabricator.wikimedia.org/T163413) (owner: 10Jcrespo) [09:53:45] (03PS3) 10Elukey: Set Xms value for the Hadoop Yarn Resource Manager's JVM [puppet] - 10https://gerrit.wikimedia.org/r/348915 (https://phabricator.wikimedia.org/T159219) [09:53:48] (03CR) 10Elukey: [V: 032 C: 032] Set Xms value for the Hadoop Yarn Resource Manager's JVM [puppet] - 10https://gerrit.wikimedia.org/r/348915 (https://phabricator.wikimedia.org/T159219) (owner: 10Elukey) [09:54:00] (03PS4) 10Elukey: Set Xms value for the Hadoop Yarn Resource Manager's JVM [puppet] - 10https://gerrit.wikimedia.org/r/348915 (https://phabricator.wikimedia.org/T159219) [09:54:06] (03CR) 10Elukey: [V: 032 C: 032] Set Xms value for the Hadoop Yarn Resource Manager's JVM [puppet] - 10https://gerrit.wikimedia.org/r/348915 (https://phabricator.wikimedia.org/T159219) (owner: 10Elukey) [09:54:16] PROBLEM - HTTPS on tegmen is CRITICAL: SSL CRITICAL - Certificate icinga.wikimedia.org expired [09:54:26] PROBLEM - HTTPS-tendril on tegmen is CRITICAL: SSL CRITICAL - Certificate tendril.wikimedia.org expired [09:55:00] (03CR) 10Alexandros Kosiaris: [C: 032] Switchover icinga.wikimedia.org to tegmen [dns] - 10https://gerrit.wikimedia.org/r/349184 (https://phabricator.wikimedia.org/T163324) (owner: 10Alexandros Kosiaris) [09:55:23] these ^ are expected [09:55:29] switching the DNS around now [09:55:38] (03PS1) 10Hashar: salt-misc: pin pylint <1.7.0 [software] - 10https://gerrit.wikimedia.org/r/349189 [09:55:51] the lag is normal now [09:55:57] jynus: can I merge your change? [09:56:07] elukey, I was doing that [09:56:10] (now I use 2K batch instead of 10K) [09:56:11] okok [09:56:12] :) [09:56:19] I am ok to merge [09:56:26] PROBLEM - puppet last run on tegmen is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[acme-setup-acme-icinga],Exec[acme-setup-acme-tendril] [09:56:30] (03CR) 10Volans: "I don't see why not marostegui, the current CNAMEs are just wrong. Probably x1-slave needs to be decided if it's ok or not like this." [dns] - 10https://gerrit.wikimedia.org/r/348440 (https://phabricator.wikimedia.org/T155099) (owner: 10Marostegui) [09:57:37] (03CR) 10Marostegui: "> I don't see why not marostegui, the current CNAMEs are just wrong." [dns] - 10https://gerrit.wikimedia.org/r/348440 (https://phabricator.wikimedia.org/T155099) (owner: 10Marostegui) [09:58:01] (03CR) 10Marostegui: [C: 031] salt-misc: pin pylint <1.7.0 [software] - 10https://gerrit.wikimedia.org/r/349189 (owner: 10Hashar) [10:01:53] (03PS1) 10Alexandros Kosiaris: einsteinium: Set do_acme: false [puppet] - 10https://gerrit.wikimedia.org/r/349191 [10:03:17] (03PS2) 10Alexandros Kosiaris: einsteinium: Set do_acme: false [puppet] - 10https://gerrit.wikimedia.org/r/349191 [10:03:23] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] einsteinium: Set do_acme: false [puppet] - 10https://gerrit.wikimedia.org/r/349191 (owner: 10Alexandros Kosiaris) [10:07:40] !log restart Yarn Resource manager on analytics1002 (hadoop master standby) to pick up new JVM settings [10:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:03] (03PS1) 10Muehlenhoff: Load nf_conntrack via /etc/modules-load.d/ [puppet] - 10https://gerrit.wikimedia.org/r/349193 (https://phabricator.wikimedia.org/T136094) [10:16:12] <_joe_> Amir1: can you report on the bug how many dupicates you find [10:16:31] <_joe_> and the times of the edits thos duplicates refer to, if possible? [10:17:00] _joe_: when I run it without limit it times out so I run it on batches of 2K [10:17:22] <_joe_> Amir1: ok, do you see which edits have duplicates? [10:17:24] They vary in all types of times [10:17:28] <_joe_> in the logs I mean [10:17:36] <_joe_> uh? [10:17:41] <_joe_> that's strange [10:17:55] <_joe_> the edits, I mean, not the duplicate runs [10:18:22] Yeah, I think we have another source of duplication too [10:18:29] my guess goes to api.php functionality [10:18:35] which is disabled for now [10:18:51] but it was up for some time [10:20:18] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 22 probes of 284 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [10:21:08] _joe_: because, out of 104M rows in ores_classification almost 80M of them are because of that functionality and it was heavily under the pressure because of that bot [10:21:40] <_joe_> Amir1: oh I see [10:22:14] <_joe_> Amir1: then Krinkle's theory that the current duplicates were just unacknowledged jobs might be correct [10:22:54] <_joe_> and that, luckily, is easy to solve on my part [10:23:18] awesome, as another check we should get the schema change deployed too [10:24:29] 06Operations, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 13Patch-For-Review, and 2 others: jobqueue is full of refreshlinks duplicates after the switchover. - https://phabricator.wikimedia.org/T163418#3197126 (10Joe) [10:26:42] (03CR) 10Muehlenhoff: [C: 031] "Thanks, looks good. We can give it a live test by rebooting logstash1004 again." [puppet] - 10https://gerrit.wikimedia.org/r/349168 (owner: 10Gehel) [10:28:16] (03PS1) 10Alexandros Kosiaris: role::icinga: Also allow IPv6 from rsync [puppet] - 10https://gerrit.wikimedia.org/r/349195 [10:28:18] (03PS1) 10Alexandros Kosiaris: network::constants: Fix einsteinium's IPv6 IP [puppet] - 10https://gerrit.wikimedia.org/r/349196 [10:28:30] (03CR) 10jerkins-bot: [V: 04-1] role::icinga: Also allow IPv6 from rsync [puppet] - 10https://gerrit.wikimedia.org/r/349195 (owner: 10Alexandros Kosiaris) [10:28:37] (03CR) 10jerkins-bot: [V: 04-1] network::constants: Fix einsteinium's IPv6 IP [puppet] - 10https://gerrit.wikimedia.org/r/349196 (owner: 10Alexandros Kosiaris) [10:29:53] (03PS2) 10Alexandros Kosiaris: role::icinga: Also allow IPv6 for icinga rsync [puppet] - 10https://gerrit.wikimedia.org/r/349195 [10:29:55] (03PS2) 10Alexandros Kosiaris: network::constants: Fix einsteinium's IPv6 IP [puppet] - 10https://gerrit.wikimedia.org/r/349196 [10:32:11] !log installing remaining dbus updates from jessie point update [10:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:39] (03CR) 10Alexandros Kosiaris: [C: 032] role::icinga: Also allow IPv6 for icinga rsync [puppet] - 10https://gerrit.wikimedia.org/r/349195 (owner: 10Alexandros Kosiaris) [10:32:56] (03CR) 10Alexandros Kosiaris: [C: 032] network::constants: Fix einsteinium's IPv6 IP [puppet] - 10https://gerrit.wikimedia.org/r/349196 (owner: 10Alexandros Kosiaris) [10:36:56] volans: nsca daemon starting to spawn on einsteinium [10:37:08] akosiaris: nice! let's join the party [10:37:18] 430 already [10:37:28] soo... what on earth is going on over there [10:38:02] rt_sigaction(SIGCHLD, NULL, {0x402e50, [], SA_RESTORER|SA_NOCLDSTOP, 0x7fc30e8e20e0}, 8) = 0 [10:39:06] seems it's spawning one per second [10:40:03] unrelated but I also just noticed this #011Max concurrent service checks (5000) has been reached. [10:40:17] I guess we can increase the limit [10:40:18] that was on tegmen too [10:40:41] not sure if einsteinium was doing it before too, let's check the logs [10:40:49] it was probably dong it [10:41:01] as far as checks go they both do the exact same thing [10:41:30] yes, we have 20k~30k Nudging in all the logs [10:41:57] sometimes more [10:42:08] sometimes less [10:42:11] anyway, that's probably unrelated and easy to fix [10:42:20] now we have 557 nsca processes [10:42:37] all of these children of init directly so well daemonized [10:43:35] Active: active (running) since Wed 2016-11-02 17:25:49 UTC; 5 months 16 days ago [10:43:35] they are child of 1012 no? [10:43:55] at least ps fax show them as child of 1012 [10:44:13] my previous paste is from: sudo strace -fF -p 1012 [10:44:18] 06Operations, 06Performance-Team: Access request to Icinga control panel to acknowledge Performance alerts - https://phabricator.wikimedia.org/T163432#3197165 (10Gilles) [10:44:29] yeah indeed, me reding ps output wrong [10:44:29] 06Operations, 06Performance-Team: Access request to Icinga control panel to acknowledge Performance alerts - https://phabricator.wikimedia.org/T163432#3197177 (10Gilles) p:05Triage>03High [10:44:41] so.. a reload gone wrong ? [10:45:03] If I attach to a child [10:45:07] it prints some stuff and exit [10:45:27] the strace attach make it exit... [10:45:40] let me paste into the task [10:45:49] 06Operations, 06Performance-Team: Access request to Icinga control panel to acknowledge Performance alerts - https://phabricator.wikimedia.org/T163432#3197165 (10Gilles) @faidon @fgiunchedi if someone can ack this alert for now, that'd be much appreciated: > Notification Type: PROBLEM > > Service: https://gr... [10:45:56] PROBLEM - NTP peers on achernar is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [10:46:24] command_file=/rw/nagios.cmd [10:46:26] ??? [10:46:29] yes [10:46:37] what I am missing ? [10:47:19] 06Operations, 13Patch-For-Review, 15User-Joe: etcd switchover/enhancements - https://phabricator.wikimedia.org/T159687#3197180 (10Joe) [10:47:44] it's wrong in the private repo to start with [10:47:56] RECOVERY - NTP peers on achernar is OK: NTP OK: Offset 0.003782 secs [10:47:57] what that entire file is in the private repo.. beats me [10:49:04] 06Operations, 10Monitoring, 13Patch-For-Review: Tegmen: process spawn loop + failed icinga + failing puppet - https://phabricator.wikimedia.org/T163286#3197185 (10Volans) So after the switch of `tegmen` as active now we have the issue on `einsteinium`: ``` 1012 ? Ss 48:55 /usr/sbin/nsca --daemon... [10:49:48] akosiaris: in puppet repo seems always be used with absolute path: /var/lib/nagios/rw/nagios.cmd [10:50:35] so seems wrong only in the private repo [10:50:43] and we have 2 copies of it [10:50:46] grrr [10:50:55] one under nagios/nsca.cfg and one under icinga/nsca.cfg [10:50:59] and ofc they differ [10:51:10] including that line [10:51:21] great! [10:51:49] the other is command_file=nagios.cmd [10:51:54] even worse [10:52:00] whould icinga/nagios be smart and use the right prefix? [10:52:14] maybe something to do with the chroot ? [10:52:17] 06Operations, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 13Patch-For-Review, and 2 others: jobqueue is full of refreshlinks duplicates after the switchover. - https://phabricator.wikimedia.org/T163418#3196736 (10Joe) FTR, the queue is dropping fast, as the number of processed jobs. I'll de-deploy my hack... [10:52:32] on our is icinga.cfg:175:command_file=/var/lib/nagios/rw/nagios.cmd [10:52:36] on the host [10:52:43] nsca_chroot=/var/lib/nagios [10:53:03] the funny thing is .. this might not be related at all [10:53:42] so far it seems related :D [10:54:01] to the processes endless spawning ? [10:54:04] how ? [10:54:17] PROBLEM - NTP peers on maerlant is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [10:54:24] so both files seems to exist from 2015 at the big commit d3870852084b7530b29897ecf15762a9aad605ac [10:54:34] yeah the move to secret() [10:54:49] so how this happened only now? [10:55:02] now as in the last month [10:55:11] something else has changed.. not the config [10:55:17] RECOVERY - NTP peers on maerlant is OK: NTP OK: Offset 0.000262 secs [10:55:17] but what ? [10:55:44] and why if I attach strace it actually unblock it [10:55:45] and exit [10:56:02] so my theory so far [10:56:09] the child are "stuck" so don't return to the parent [10:56:14] that keep spawning childs [10:56:18] open("/rw/nagios.cmd", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4 [10:56:22] until they will answer "ok" [10:56:34] that's what I got from strace [10:56:40] oh [10:56:42] yes, me too, is on the task [10:56:43] you pasted it already [10:56:47] hmm [10:56:57] why they are "stuck", no idea so far [10:57:06] hi, why isn't it possible to read the actual contents of de:Portal_Diskussion:Lebewesen by API? [10:57:17] PROBLEM - NTP peers on chromium is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [10:57:25] waiting for something forever? [10:57:33] moritzm: FYI a couple of NTP alarms going off [10:57:44] weird [10:57:52] and what triggers that ... [10:58:17] RECOVERY - NTP peers on chromium is OK: NTP OK: Offset -0.000966 secs [10:58:27] there are children since 2016 [10:58:34] (03PS1) 10ArielGlenn: updated for support up through MW 1.29 [dumps] - 10https://gerrit.wikimedia.org/r/349199 [10:58:40] (03CR) 10jerkins-bot: [V: 04-1] updated for support up through MW 1.29 [dumps] - 10https://gerrit.wikimedia.org/r/349199 (owner: 10ArielGlenn) [10:58:46] so I guess this is normal behavior. spawning a child to process the results ? [10:58:52] makes sense [10:58:57] aaahhh [10:59:02] damn I have an indea [10:59:40] * akosiaris checking something [10:59:57] PROBLEM - NTP peers on hydrogen is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [11:00:11] what's up with ntp ? [11:00:15] (03Abandoned) 10ArielGlenn: updated for support up through MW 1.29 [dumps] - 10https://gerrit.wikimedia.org/r/349199 (owner: 10ArielGlenn) [11:00:22] akosiaris: are we here? https://github.com/NagiosEnterprises/nsca/blob/nsca-2-9-1/src/nsca.c#L890 [11:01:04] volans: I think it's the restart of icinga that triggers this [11:01:17] the one done by the cron script [11:01:27] told ya :D [11:01:34] but that is every 10 minutes [11:01:51] which is very well timed with the processes starting time [11:01:56] 10:30, 10:40, 10:50 [11:01:59] ;-) [11:02:14] ntp is stange, e.g. hydrogen lost the connection to nescio ATM [11:02:22] so, IIRC we ship our own init script for icinga [11:02:25] stratum 16, i.e. a dead connection [11:02:40] akosiaris: then I ri-propose my proposal in https://phabricator.wikimedia.org/T163286#3192354 :D [11:02:48] for reason I do not know, nor comprehend but it's probably a mistake [11:03:01] (03PS3) 10ArielGlenn: updated for support up through MW 1.29 [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/347625 [11:03:36] volans: Notice: /Stage[main]/Icinga/File[/var/lib/nagios/rw/nagios.cmd]/group: group changed 'icinga' to 'www-data' [11:03:36] Notice: /Stage[main]/Icinga/File[/var/lib/nagios/rw/nagios.cmd]/mode: mode changed '0660' to '0664' [11:03:48] hmm probably related ? [11:03:55] yes, I noticed that too [11:04:02] I think the initscript or icinga restarting changes that, and then puppet changes it [11:04:02] forgot to add to the task [11:04:14] same goes for puppet_services [11:04:17] PROBLEM - NTP peers on nescio is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [11:05:17] RECOVERY - NTP peers on nescio is OK: NTP OK: Offset -0.000557 secs [11:06:58] akosiaris: do you agree to make just the rsync on a different path and do not restart and have a make-icinga-primary script that masked in run-no-puppet does fix the permission and restart both icinga and nsca? [11:08:41] volans: I agree it would make the race condition show up way way less often, but it would not solve the issue [11:08:56] why not solve? [11:09:40] ofc also a make-icinga-secondary to be run on the other one [11:11:09] cause something changes the permissions on that file causing nsca to block. restarting it alongside icinga would clear that ofc but still .. there is some race over there [11:16:55] (03CR) 10Marostegui: [C: 032] salt-misc: pin pylint <1.7.0 [software] - 10https://gerrit.wikimedia.org/r/349189 (owner: 10Hashar) [11:18:11] (03Merged) 10jenkins-bot: salt-misc: pin pylint <1.7.0 [software] - 10https://gerrit.wikimedia.org/r/349189 (owner: 10Hashar) [11:18:43] (03CR) 10Marostegui: "recheck" [software] - 10https://gerrit.wikimedia.org/r/349175 (https://phabricator.wikimedia.org/T163413) (owner: 10Marostegui) [11:18:55] (03CR) 10ArielGlenn: [C: 032] updated for support up through MW 1.29 [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/347625 (owner: 10ArielGlenn) [11:19:16] the NTP problems were caused by https://gerrit.wikimedia.org/r/349196 , the change of einsteinium's IPv6 addressed caused puppet to rewrite /etc/ntp.conf, which made the ntp servers restart [11:19:23] and after restart they lose synchronisation [11:19:36] (03PS3) 10ArielGlenn: add a cheap sample script for importing to a local instance [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/347626 [11:20:12] also, some of the ntpds failed to start due to the stupid startup race, fortunately all jessie systems use timesyncd, which just works fine [11:20:41] (03PS2) 10Marostegui: s1.hosts: Add db2071 [software] - 10https://gerrit.wikimedia.org/r/349175 (https://phabricator.wikimedia.org/T163413) [11:23:24] !log changing db2071 to replicate from db2016 [11:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:13] (03CR) 10Marostegui: [C: 032] s1.hosts: Add db2071 [software] - 10https://gerrit.wikimedia.org/r/349175 (https://phabricator.wikimedia.org/T163413) (owner: 10Marostegui) [11:25:33] (03PS2) 10Hashar: rpc: raise exception instead of die [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349185 [11:25:38] (03Merged) 10jenkins-bot: s1.hosts: Add db2071 [software] - 10https://gerrit.wikimedia.org/r/349175 (https://phabricator.wikimedia.org/T163413) (owner: 10Marostegui) [11:25:46] (03CR) 10DCausse: [C: 031] logstash - raise elasticsearch shard alert threshold to 34 [puppet] - 10https://gerrit.wikimedia.org/r/349168 (owner: 10Gehel) [11:26:03] (03CR) 10ArielGlenn: [C: 032] add a cheap sample script for importing to a local instance [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/347626 (owner: 10ArielGlenn) [11:26:36] (03CR) 10jerkins-bot: [V: 04-1] rpc: raise exception instead of die [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349185 (owner: 10Hashar) [11:26:44] (03PS2) 10ArielGlenn: process zero-length text entries as regular sql inserts [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/348138 [11:28:27] (03PS1) 10KartikMistry: Remove redundant setting from cxsave wgRateLimits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349201 [11:28:30] (03PS1) 10Marostegui: db-codfw.php: Pool db2071 with small weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349202 (https://phabricator.wikimedia.org/T163413) [11:29:24] (03CR) 10Jcrespo: [C: 031] db-codfw.php: Pool db2071 with small weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349202 (https://phabricator.wikimedia.org/T163413) (owner: 10Marostegui) [11:31:07] (03CR) 10Marostegui: [C: 032] db-codfw.php: Pool db2071 with small weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349202 (https://phabricator.wikimedia.org/T163413) (owner: 10Marostegui) [11:32:33] <_joe_> !log removing hack for jobqueue's refreshlinks T163418 from the jobrunners [11:32:41] (03Merged) 10jenkins-bot: db-codfw.php: Pool db2071 with small weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349202 (https://phabricator.wikimedia.org/T163413) (owner: 10Marostegui) [11:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:42] T163418: jobqueue is full of refreshlinks duplicates after the switchover. - https://phabricator.wikimedia.org/T163418 [11:32:52] (03CR) 10jenkins-bot: db-codfw.php: Pool db2071 with small weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349202 (https://phabricator.wikimedia.org/T163413) (owner: 10Marostegui) [11:36:44] (03CR) 10ArielGlenn: [C: 032] process zero-length text entries as regular sql inserts [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/348138 (owner: 10ArielGlenn) [11:43:10] 06Operations, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3197280 (10elukey) [11:45:22] 06Operations, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 13Patch-For-Review, and 2 others: jobqueue is full of refreshlinks duplicates after the switchover. - https://phabricator.wikimedia.org/T163418#3197284 (10Joe) The queue is down to 250K jobs, and I am confident all the old refreshlinks jobs have be... [11:48:31] (03PS1) 10Alexandros Kosiaris: Fix sync-icinga-state cron presence/absence [puppet] - 10https://gerrit.wikimedia.org/r/349203 [12:04:49] 06Operations, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 13Patch-For-Review, and 2 others: jobqueue is full of refreshlinks duplicates after the switchover. - https://phabricator.wikimedia.org/T163418#3197304 (10Joe) p:05Unbreak!>03High [12:11:35] !log installing icu security updates [12:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:09] PROBLEM - Check Varnish expiry mailbox lag on cp2024 is CRITICAL: CRITICAL: expiry mailbox lag is 804396 [12:12:36] !log restart Yarn Resource manager on analytics1001 (hadoop master) to pick up new JVM settings [12:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:20] PROBLEM - Check Varnish expiry mailbox lag on cp3037 is CRITICAL: CRITICAL: expiry mailbox lag is 677477 [12:13:59] PROBLEM - Check Varnish expiry mailbox lag on cp2017 is CRITICAL: CRITICAL: expiry mailbox lag is 742553 [12:16:24] of those 3, cp3037 is throwing a couple of 503s, not many so far ^ [12:17:00] uhhhh [12:17:03] dunno if this is the place to ask [12:17:12] but is anyone else missing the edit summary bar? [12:17:35] it's gone from safari on my mac, chrome on my mac and chrome on windows 10... [12:20:16] huh [12:20:19] if i'm logged out it appears [12:35:39] (03PS1) 10Marostegui: db-codfw.php: Increase weight db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349207 (https://phabricator.wikimedia.org/T163413) [12:38:57] 06Operations, 06Labs: wikitech logging constant errors from /MemcachedPeclBagOStuff.php - https://phabricator.wikimedia.org/T163439#3197399 (10chasemp) [12:39:04] 06Operations, 06Labs: wikitech logging constant errors from /MemcachedPeclBagOStuff.php - https://phabricator.wikimedia.org/T163439#3197413 (10chasemp) p:05Triage>03Normal [12:39:16] (03CR) 10Marostegui: [C: 032] db-codfw.php: Increase weight db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349207 (https://phabricator.wikimedia.org/T163413) (owner: 10Marostegui) [12:40:24] (03Merged) 10jenkins-bot: db-codfw.php: Increase weight db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349207 (https://phabricator.wikimedia.org/T163413) (owner: 10Marostegui) [12:40:32] 06Operations, 06Labs: wikitech logging constant errors from /MemcachedPeclBagOStuff.php - https://phabricator.wikimedia.org/T163439#3197399 (10chasemp) @bd808 and @andrew I think possibly you guys were doing something related here recently? [12:40:33] (03CR) 10jenkins-bot: db-codfw.php: Increase weight db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349207 (https://phabricator.wikimedia.org/T163413) (owner: 10Marostegui) [12:41:05] (03PS4) 10Ema: Revert "cache_upload: override CT updates on 304s" [puppet] - 10https://gerrit.wikimedia.org/r/348699 (https://phabricator.wikimedia.org/T162035) [12:41:13] (03CR) 10Ema: [V: 032 C: 032] Revert "cache_upload: override CT updates on 304s" [puppet] - 10https://gerrit.wikimedia.org/r/348699 (https://phabricator.wikimedia.org/T162035) (owner: 10Ema) [12:42:49] PROBLEM - Check Varnish expiry mailbox lag on cp2002 is CRITICAL: CRITICAL: expiry mailbox lag is 719241 [12:43:44] 06Operations, 10DBA, 13Patch-For-Review, 05codfw-rollout: codfw API slaves overloaded during the 2017-04-19 codfw switch - https://phabricator.wikimedia.org/T163351#3197434 (10Marostegui) db2071 has now been serving traffic for around 1 hour: https://grafana.wikimedia.org/dashboard/file/server-board.json?... [12:45:28] 06Operations, 10DBA, 13Patch-For-Review, 05codfw-rollout: codfw API slaves overloaded during the 2017-04-19 codfw switch - https://phabricator.wikimedia.org/T163351#3197438 (10Marostegui) Maybe with this extra server, we can now depool one of the other "old" ones and let them run analyze over the weekend,... [12:47:42] 06Operations, 10DBA, 13Patch-For-Review, 05codfw-rollout: Pool new server db2071 - https://phabricator.wikimedia.org/T163413#3197446 (10Marostegui) I have pooled db2071 with the same main traffic as the other api servers (50) but with weight 2 in API, instead of 1 as the other servers. We'll see how it goe... [12:49:34] (03PS1) 10Hashar: phpunit: automatically backup globals between tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349210 [12:54:22] (03PS3) 10Hashar: rpc: raise exception instead of die [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349185 [13:00:59] PROBLEM - Check Varnish expiry mailbox lag on cp2026 is CRITICAL: CRITICAL: expiry mailbox lag is 640675 [13:03:29] PROBLEM - Check Varnish expiry mailbox lag on cp2011 is CRITICAL: CRITICAL: expiry mailbox lag is 647989 [13:05:27] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "I think the whole idea here is to avoid spitting an error in the logs, but still sending an error message to the user." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349185 (owner: 10Hashar) [13:07:46] (03PS1) 10Marostegui: db-codfw.php: Increase API weight db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349211 (https://phabricator.wikimedia.org/T163413) [13:09:45] (03CR) 10Marostegui: [C: 032] db-codfw.php: Increase API weight db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349211 (https://phabricator.wikimedia.org/T163413) (owner: 10Marostegui) [13:10:43] (03Merged) 10jenkins-bot: db-codfw.php: Increase API weight db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349211 (https://phabricator.wikimedia.org/T163413) (owner: 10Marostegui) [13:10:51] (03CR) 10jenkins-bot: db-codfw.php: Increase API weight db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349211 (https://phabricator.wikimedia.org/T163413) (owner: 10Marostegui) [13:11:01] (03PS2) 10Ema: cache_upload: don't cache tiny objects at the backend layer [puppet] - 10https://gerrit.wikimedia.org/r/349188 (https://phabricator.wikimedia.org/T145661) [13:11:23] !log upgrading Piwik to 2.17.1 (brief downtime during the maintenance announced) [13:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:26] (03PS3) 10Ema: cache_upload: don't cache tiny objects at the backend layer [puppet] - 10https://gerrit.wikimedia.org/r/349188 (https://phabricator.wikimedia.org/T145661) [13:15:53] (03CR) 10BBlack: [C: 031] cache_upload: don't cache tiny objects at the backend layer [puppet] - 10https://gerrit.wikimedia.org/r/349188 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema) [13:18:52] !log restarting hhvm on mw2097/2098 to pick up icu security update [13:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:42] (03CR) 10Giuseppe Lavagetto: [C: 032] Refactor ReplicationController, version bump [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/342445 (owner: 10Giuseppe Lavagetto) [13:20:10] (03CR) 10Giuseppe Lavagetto: [C: 032] Add tests, improve code [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/342444 (owner: 10Giuseppe Lavagetto) [13:21:26] (03CR) 10BBlack: [C: 04-1] cache_upload: don't cache tiny objects at the backend layer [puppet] - 10https://gerrit.wikimedia.org/r/349188 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema) [13:32:59] PROBLEM - Check Varnish expiry mailbox lag on cp2005 is CRITICAL: CRITICAL: expiry mailbox lag is 657474 [13:34:11] 06Operations, 10ops-esams, 06DC-Ops: Broken IPMI/drac on cp3038 and cp3045 - https://phabricator.wikimedia.org/T157537#3197585 (10faidon) I just contacted EvoSwitch remote hands requesting to perform a power swap on both of those systems. @BBlack/@ema are Cc'ed. [13:36:09] PROBLEM - Check whether ferm is active by checking the default input chain on analytics1003 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [13:36:18] 06Operations, 10ops-eqiad: Degraded RAID on restbase1018 - https://phabricator.wikimedia.org/T163280#3197589 (10Eevans) >>! In T163280#3195765, @Cmjohnson wrote: > If I recall these have special ssds in them correct? Not this one, no; Model: Intel SSDSC2BX016T4R [13:39:29] (03PS4) 10Ema: cache_upload: do not cache tiny objects at the backend layer [puppet] - 10https://gerrit.wikimedia.org/r/349188 (https://phabricator.wikimedia.org/T145661) [13:40:09] PROBLEM - tcpircbot_service_running on tegmen is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args tcpircbot.py [13:41:09] <_joe_> ^^ volans akosiaris :P [13:41:09] RECOVERY - tcpircbot_service_running on tegmen is OK: PROCS OK: 1 process with command name python, args tcpircbot.py [13:41:15] elukey: ^did you stop ferm on analytics1003? [13:41:22] I was already looking _joe_ [13:42:06] it recovered on its own ? [13:42:32] readable, _, _ = select.select([bot.connection.socket] + files, [], []) [13:42:35] moritzm: yes sorry I am trying one thing [13:42:36] TypeError: argument must be an int, or have a fileno() method. [13:42:46] I guess it was restarted akosiaris [13:42:51] <_joe_> ahah [13:42:52] hmm [13:42:54] Apr 20 09:56:07 tegmen systemd[1]: tcpircbot-logmsgbot.service holdoff time over, scheduling restart. [13:43:09] some timeout ? [13:43:10] ooops [13:43:13] wrong time [13:43:18] elukey: ok, I was just wondering about the alert [13:43:23] yeah same error now [13:43:33] File "tcpircbot.py", line 150, in [13:43:34] moritzm: sorry didn't see it ;( [13:44:03] PROBLEM - MariaDB disk space on db1040 is CRITICAL: DISK CRITICAL - free space: /srv 97265 MB (5% inode=99%) [13:44:10] seems an error that can happen if the connection get lost [13:44:29] came back from downtime :( [13:44:34] volans: yeah, not surprised [13:44:44] but it is only 5% [13:44:53] the alter finished [13:45:03] server is catching up now [13:45:27] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack and set up analyics1058-1069 - https://phabricator.wikimedia.org/T162216#3197613 (10Cmjohnson) 05Open>03stalled Stalling this until the new servers arrives [13:45:38] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack and set up analyics1058-1069 - https://phabricator.wikimedia.org/T162216#3197617 (10Cmjohnson) p:05Triage>03Lowest [13:46:16] it is not using file-per-table config [13:46:23] yes :( [13:46:31] is any slave doing so? [13:46:38] yep [13:46:48] 06Operations, 10ops-eqiad: ms-be1016 controller cache failure - https://phabricator.wikimedia.org/T150206#3197621 (10Cmjohnson) @fgiunchedi I would like to do this today if possible. HP is harassing me about returning the part [13:46:48] any slave you trust? [13:46:57] db1081 is using it [13:47:02] one of the big ones [13:47:11] we cannot put that as master [13:47:31] can we clone it to a server with more space? [13:47:48] clone the large ones? [13:48:30] we can reclone db1068 [13:48:31] from db1081 [13:48:35] was db1059 the planned replacement master? [13:48:40] no, db1068 [13:48:44] ok [13:48:44] https://phabricator.wikimedia.org/T162133 [13:48:56] then lets clont db1081 into 68 [13:49:01] and promote it as master [13:49:03] sounds good [13:49:09] I trust the new ones [13:49:16] because I clone them from the current master [13:49:20] *cloned [13:49:27] they didn't fail last time [13:49:29] great, and I reconverted db1081 to file per table [13:49:35] that is great [13:49:36] work [13:49:47] can you do that? [13:49:52] yes, going to depool them now [13:49:59] I can try to work with db1040 meanwhile [13:50:16] we can compress some tables maybe [13:50:20] ah, it is not using file per table [13:50:22] nevermind [13:50:25] I will see [13:50:47] if it is not the master anymore, I can delete binlgos [13:51:19] RECOVERY - Check whether ferm is active by checking the default input chain on analytics1003 is OK: OK ferm input default policy is set [13:51:20] well, it is now [13:51:32] I mean that we do not need them [13:51:44] old ones, for recovery purposes [13:51:51] ah, sure [13:52:07] (03PS1) 10Marostegui: db-eqiad.php: Depool db1081 and db1068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349218 (https://phabricator.wikimedia.org/T163110) [13:53:26] !log rolling restart of kartotherian / tilerator on maps-test cluster [13:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:41] that should hold for some time [13:54:38] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1081 and db1068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349218 (https://phabricator.wikimedia.org/T163110) (owner: 10Marostegui) [13:55:45] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1081 and db1068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349218 (https://phabricator.wikimedia.org/T163110) (owner: 10Marostegui) [13:55:51] I am going to do the following things meanwhile: [13:55:53] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1081 and db1068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349218 (https://phabricator.wikimedia.org/T163110) (owner: 10Marostegui) [13:56:16] stop replication db1040 -> db2019 [13:56:26] and prepare the puppet patches [13:56:31] cool [13:56:32] (03CR) 10Ema: [C: 031] "Lack of fullstops, trailing whitespace. OCD aside, LGTM and to pcc https://puppet-compiler.wmflabs.org/6184/." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/349193 (https://phabricator.wikimedia.org/T136094) (owner: 10Muehlenhoff) [13:56:34] thanks [13:57:58] !log running reset slave all on db2019 [13:58:03] !log Stop MySQL on db1068 and db1081 for maintenance - T163110 [13:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:12] T163110: Reclone db1068 to become a slave in s4 - https://phabricator.wikimedia.org/T163110 [13:58:19] !log rolling restart of kartotherian / tilerator on maps eqiad cluster [13:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:26] marostegui, should we reimage? [13:58:32] is it jessie or trusty? [13:58:48] jynus: I rather not, because that means getting mariadb 10.0.30 no? [13:58:51] it is jessie [13:58:55] then cool [13:59:09] let's keep it if we can on an older version [13:59:16] they are both 10.0.23 :) [13:59:18] so that is good [13:59:19] good [14:00:10] so no longer replicating from eqiad to codfw on s4 [14:00:27] I will put that back on when 68 is back [14:01:13] ok, I guess it will take 2 hours of so to copy + catchup [14:01:21] np [14:01:34] (03PS2) 10Muehlenhoff: Load nf_conntrack via /etc/modules-load.d/ [puppet] - 10https://gerrit.wikimedia.org/r/349193 (https://phabricator.wikimedia.org/T136094) [14:01:47] hopefuly less, it is not SSD, but it is with dc [14:01:53] *within [14:02:06] ah, catchup, yes [14:02:19] PROBLEM - MariaDB Slave IO: s4 on db1081 is CRITICAL: CRITICAL slave_io_state could not connect [14:02:20] PROBLEM - MariaDB Slave SQL: s4 on db1081 is CRITICAL: CRITICAL slave_sql_state could not connect [14:02:26] i silenced it :| [14:02:46] it is ok, replication doesn't ping on the passive dc [14:05:01] marostegui, important- we need 68 to boot back in statement [14:05:05] ok, copying data [14:05:15] true [14:05:34] so I will puppet merge ASAP [14:05:39] great [14:05:49] i will double check before bringing it up [14:06:08] although it will mess up the replication check :-/ [14:06:40] !log rolling restart of kartotherian / tilerator on maps codfw cluster [14:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:46] we beter downtime replication check on s4 eqiad everywhere [14:07:01] (only the lag) [14:07:02] i will do that [14:07:48] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3197679 (10Ottomata) @Halfak, @leila. We've got some quotes back from vendors, and have a choice between the [[ http://www.amd.com/en-us/products/graphics/workstatio... [14:08:57] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3197682 (10Gehel) a bad blocks check (as suggested by @Papaul does not find anything wrong with sda: ``` gehel@elastic2020:~$ sudo badblocks -... [14:09:38] !log rebooting osmium for kernel update to Linux 4.9 [14:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:16] jynus: done [14:13:48] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3197684 (10Marostegui) Seriously, this looks _really_ similar to T149553 (and it is the same vendor even), is there anyway to justify to HP to... [14:17:05] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3197704 (10Gehel) @Marostegui yes, this sound like a good idea, but this is for @Papaul / @RobH to answer. I am way out of my depth here... [14:18:16] 06Operations, 10fundraising-tech-ops: Revisit paging strategy for frack servers - https://phabricator.wikimedia.org/T163368#3197707 (10Jgreen) @Dzahn thanks for the many clarifications! I think I understand. So as of today if "sms" does not show up in contact_groups for a host or service, individual Ops don't... [14:23:10] !log rebooting radium (tor relay) for kernel update to Linux 4.9 [14:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:01] moritzm: check if there's a Tor update as well [14:24:13] (reprepro update, we fetch the latest from Tor directly) [14:26:25] 06Operations, 10Monitoring, 13Patch-For-Review: Tegmen: process spawn loop + failed icinga + failing puppet - https://phabricator.wikimedia.org/T163286#3197733 (10akosiaris) I 've finally figured out the sheer beauty of this bug. It's a race condition between 3 components (puppet+icinga+nsca) What happens i... [14:29:11] k, looking [14:31:08] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (doing): Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T163292#3197734 (10GWicke) [14:32:41] !log upgrading tor on radium to 0.2.9.10 [14:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:19] RECOVERY - Check Varnish expiry mailbox lag on cp3037 is OK: OK: expiry mailbox lag is 21568 [14:33:42] (03PS5) 10Gehel: logstash - raise elasticsearch shard alert threshold to 34 [puppet] - 10https://gerrit.wikimedia.org/r/349168 [14:33:49] 06Operations, 06DC-Ops, 10netops: Interface errors on pfw-codfw:xe-15/0/0 - https://phabricator.wikimedia.org/T163447#3197764 (10ayounsi) [14:34:29] (03CR) 10BBlack: [C: 031] cache_upload: do not cache tiny objects at the backend layer [puppet] - 10https://gerrit.wikimedia.org/r/349188 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema) [14:35:54] 06Operations, 06DC-Ops, 10Traffic, 10netops, 13Patch-For-Review: Interface errors on asw-c-codfw:xe-7/0/46 - https://phabricator.wikimedia.org/T163323#3197781 (10ayounsi) a:03ayounsi [14:36:56] (03PS1) 10Jcrespo: mariadb: promote db1064 as the new s4 master on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/349220 (https://phabricator.wikimedia.org/T163110) [14:40:41] (03PS1) 10Alexandros Kosiaris: icinga: Fix permissions for /var/lib/nagios/rw [puppet] - 10https://gerrit.wikimedia.org/r/349221 (https://phabricator.wikimedia.org/T163286) [14:41:49] (03CR) 10Marostegui: [C: 04-1] "Minor thing: the commit message says db1064, but it is db1068. The rest looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/349220 (https://phabricator.wikimedia.org/T163110) (owner: 10Jcrespo) [14:42:27] 06Operations, 06Labs: Ensure we can survive a loss of labservices1001 - https://phabricator.wikimedia.org/T163402#3197808 (10Paladox) @chasemp im not sure if you already thought of this but what about switching labs services to the second labservice if there is a secondary one? I am unsure if nodepool will be... [14:42:50] RECOVERY - Check Varnish expiry mailbox lag on cp2002 is OK: OK: expiry mailbox lag is 4 [14:48:11] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3197814 (10Papaul) @Marostegui on my side i will have to have something to show HP that the CPU is bad since i have nothing pointing that the... [14:49:09] 06Operations, 10ops-eqiad: Degraded RAID on ocg1001 - https://phabricator.wikimedia.org/T161158#3197815 (10Cmjohnson) @Dzahn I replaced the disk in slot 0 which is /dev/sda. I changed bios order to boot from /dev/sdb but it does not appear grub is installed. If I leave it be it defaults to a fresh install.... [14:51:38] 06Operations, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 13Patch-For-Review, and 2 others: jobqueue is full of refreshlinks duplicates after the switchover. - https://phabricator.wikimedia.org/T163418#3196736 (10GWicke) Did you see repeat executions in this case, beyond the initial root to leaf job expan... [14:52:58] (03PS2) 10Jcrespo: mariadb: promote db1068 as the new s4 master on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/349220 (https://phabricator.wikimedia.org/T163110) [14:53:18] (03CR) 10Jcrespo: "Thank you, didn't see that." [puppet] - 10https://gerrit.wikimedia.org/r/349220 (https://phabricator.wikimedia.org/T163110) (owner: 10Jcrespo) [14:55:22] 06Operations, 06DC-Ops, 10netops: Interface errors on pfw-codfw:xe-15/0/0 - https://phabricator.wikimedia.org/T163447#3197830 (10Jgreen) If we're just talking about a <1m traffic hiccup then it's fine to do anytime. [14:55:29] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3197831 (10leila) @Ottomata Bob and I reviewed and we are happy with your choice. [14:55:42] 06Operations, 10ops-eqiad: decommission the old pay-lvs1001/pay-lvs1002 boxes - https://phabricator.wikimedia.org/T156284#3197832 (10Cmjohnson) 05Open>03Resolved Removed from racks, disks removes and destroyed, updated racktables. [14:57:04] should I merge now https://gerrit.wikimedia.org/r/349220 ? [14:57:10] marostegui^ [14:57:39] I would wait a bit until the server is at least catching up [14:57:58] but we need it on STATEMENT before it boots [14:58:17] mysql, I mean [14:58:38] I thought about changing it manually [14:58:41] but yeah [14:58:43] just go ahead [14:59:01] if you do it manually, we can wait [14:59:10] Yeah I thought about doing it manually [14:59:35] I will prepare other patches [14:59:38] 06Operations, 06DC-Ops, 10Traffic, 10netops, 13Patch-For-Review: Interface errors on asw-c-codfw:xe-7/0/46 - https://phabricator.wikimedia.org/T163323#3197843 (10ayounsi) [15:00:18] (03PS1) 10Gehel: elasticsearch - stop using experimental apt repository [puppet] - 10https://gerrit.wikimedia.org/r/349222 [15:00:51] (03CR) 10Muehlenhoff: [C: 031] elasticsearch - stop using experimental apt repository [puppet] - 10https://gerrit.wikimedia.org/r/349222 (owner: 10Gehel) [15:01:05] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3197846 (10Ottomata) Great, thanks! [15:01:42] 06Operations, 10fundraising-tech-ops: Revisit paging strategy for frack servers - https://phabricator.wikimedia.org/T163368#3197848 (10Jgreen) I removed 'sms' from notification for frack hosts, and changed myself to 24x7. [15:02:41] 06Operations, 06DC-Ops, 10netops: Interface errors on pfw-codfw:xe-15/0/0 - https://phabricator.wikimedia.org/T163447#3197878 (10ayounsi) To clarify, that's on Node1, which might be named pfw2-codfw, ( AJ5112AA0049 ) FPC0, PIC0. [15:03:08] (03CR) 10Gehel: [C: 032] elasticsearch - stop using experimental apt repository [puppet] - 10https://gerrit.wikimedia.org/r/349222 (owner: 10Gehel) [15:05:45] (03CR) 10Gehel: [C: 032] logstash - raise elasticsearch shard alert threshold to 34 [puppet] - 10https://gerrit.wikimedia.org/r/349168 (owner: 10Gehel) [15:05:52] (03PS6) 10Gehel: logstash - raise elasticsearch shard alert threshold to 34 [puppet] - 10https://gerrit.wikimedia.org/r/349168 [15:06:12] (03PS1) 10Alexandros Kosiaris: Make varnishkafka log producer descriptions unique [puppet] - 10https://gerrit.wikimedia.org/r/349224 (https://phabricator.wikimedia.org/T163286) [15:06:20] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/349221 (https://phabricator.wikimedia.org/T163286) (owner: 10Alexandros Kosiaris) [15:07:26] (03PS2) 10Alexandros Kosiaris: icinga: Fix permissions for /var/lib/nagios/rw [puppet] - 10https://gerrit.wikimedia.org/r/349221 (https://phabricator.wikimedia.org/T163286) [15:07:31] (03PS3) 10Alexandros Kosiaris: icinga: Fix permissions for /var/lib/nagios/rw [puppet] - 10https://gerrit.wikimedia.org/r/349221 (https://phabricator.wikimedia.org/T163286) [15:07:37] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] icinga: Fix permissions for /var/lib/nagios/rw [puppet] - 10https://gerrit.wikimedia.org/r/349221 (https://phabricator.wikimedia.org/T163286) (owner: 10Alexandros Kosiaris) [15:08:30] (03PS1) 10Jgreen: remove jgreen from icinga group 'sms' [puppet] - 10https://gerrit.wikimedia.org/r/349225 [15:11:32] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3197891 (10Marostegui) @Papaul right! Just for the record, the way we were able to justify the error was by seeing it on the ILO after one of... [15:11:51] (03PS2) 10Jgreen: remove jgreen from icinga group 'sms' [puppet] - 10https://gerrit.wikimedia.org/r/349225 [15:12:57] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: (null) [15:13:41] elasticsarch warning is me, seems there is an issue with my new check... [15:13:52] (03CR) 10Ema: [V: 032 C: 032] cache_upload: do not cache tiny objects at the backend layer [puppet] - 10https://gerrit.wikimedia.org/r/349188 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema) [15:13:58] (03CR) 10Jgreen: [C: 032] remove jgreen from icinga group 'sms' [puppet] - 10https://gerrit.wikimedia.org/r/349225 (owner: 10Jgreen) [15:14:00] (03PS5) 10Ema: cache_upload: do not cache tiny objects at the backend layer [puppet] - 10https://gerrit.wikimedia.org/r/349188 (https://phabricator.wikimedia.org/T145661) [15:14:07] (03CR) 10Ema: [V: 032 C: 032] cache_upload: do not cache tiny objects at the backend layer [puppet] - 10https://gerrit.wikimedia.org/r/349188 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema) [15:14:21] akosiaris: it looks like https://gerrit.wikimedia.org/r/#/c/349221 introduced an invalid relationship... [15:14:56] sigh [15:15:05] 06Operations, 10fundraising-tech-ops: Revisit paging strategy for frack servers - https://phabricator.wikimedia.org/T163368#3197901 (10Jgreen) >>! In T163368#3197848, @Jgreen wrote: > I removed 'sms' from notification for frack hosts, and changed myself to 24x7. ...and removed myself from 'sms' [15:15:23] (03PS6) 10Ema: cache_upload: do not cache tiny objects at the backend layer [puppet] - 10https://gerrit.wikimedia.org/r/349188 (https://phabricator.wikimedia.org/T145661) [15:15:31] (03CR) 10Ema: [V: 032 C: 032] cache_upload: do not cache tiny objects at the backend layer [puppet] - 10https://gerrit.wikimedia.org/r/349188 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema) [15:16:06] PROBLEM - PyBal backends health check on lvs2002 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 [15:16:16] PROBLEM - pybal on lvs2002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [15:16:19] !log disabling pybal on lvs2002 for T163323 [15:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:27] T163323: Interface errors on asw-c-codfw:xe-7/0/46 - https://phabricator.wikimedia.org/T163323 [15:17:55] !log deleting duplicate rows in ores_classification dated after revision 775502802 (dated April 15th) (T163337) [15:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:03] T163337: Watchlist entries duplicated several times - https://phabricator.wikimedia.org/T163337 [15:18:04] (03PS2) 10Chad: Jenkins: install jdk, not just jre [puppet] - 10https://gerrit.wikimedia.org/r/348961 [15:18:06] PROBLEM - puppet last run on tegmen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:20:16] (03PS2) 10Alexandros Kosiaris: Make varnishkafka log producer descriptions unique [puppet] - 10https://gerrit.wikimedia.org/r/349224 (https://phabricator.wikimedia.org/T163286) [15:20:18] (03PS1) 10Alexandros Kosiaris: icinga::event_handlers::raid: Remove dependency on nagios.cmd [puppet] - 10https://gerrit.wikimedia.org/r/349227 [15:21:28] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] icinga::event_handlers::raid: Remove dependency on nagios.cmd [puppet] - 10https://gerrit.wikimedia.org/r/349227 (owner: 10Alexandros Kosiaris) [15:21:36] (03PS2) 10Alexandros Kosiaris: icinga::event_handlers::raid: Remove dependency on nagios.cmd [puppet] - 10https://gerrit.wikimedia.org/r/349227 [15:21:39] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] icinga::event_handlers::raid: Remove dependency on nagios.cmd [puppet] - 10https://gerrit.wikimedia.org/r/349227 (owner: 10Alexandros Kosiaris) [15:21:58] damn icinga is a rabbithole [15:24:02] akosiaris: you can blame me for this last one [15:24:26] :P [15:24:28] (03CR) 10BBlack: [C: 031] Make varnishkafka log producer descriptions unique [puppet] - 10https://gerrit.wikimedia.org/r/349224 (https://phabricator.wikimedia.org/T163286) (owner: 10Alexandros Kosiaris) [15:24:34] reason was the path is hardcoded into the python script [15:24:37] and wanted to be sure was there [15:24:50] there you have it.. you can't [15:24:55] :D [15:24:57] it's impossible to be sure :P [15:25:01] 06Operations, 10fundraising-tech-ops: Revisit paging strategy for frack servers - https://phabricator.wikimedia.org/T163368#3197946 (10Jgreen) Another question...does it make sense to move IRC notifications out of #wikimedia-operations and into #wikimedia-fundraising? I'm not sure of the mechanics of doing tha... [15:25:02] RECOVERY - puppet last run on tegmen is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [15:25:27] akosiaris: between icinga and puppet we are guaranteed to consume all the mental energy of any size staff :-P [15:25:29] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3197947 (10ema) 05Open>03Resolved a:03ema The user-facing issue h... [15:26:12] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: (null) [15:26:32] PROBLEM - ElasticSearch health check for shards on logstash1005 is CRITICAL: (null) [15:26:48] :-) [15:26:52] PROBLEM - ElasticSearch health check for shards on relforge1002 is CRITICAL: (null) [15:26:52] PROBLEM - ElasticSearch health check for shards on relforge1001 is CRITICAL: (null) [15:26:56] gehel: shout if you need help [15:27:07] almost there! [15:28:25] ok, np [15:28:28] (03PS1) 10Gehel: elasticsearch - icinga check has special characters that need escaping [puppet] - 10https://gerrit.wikimedia.org/r/349229 [15:29:04] volans: I'm not entirely sure how icinga escaping works... but my guess is that https://gerrit.wikimedia.org/r/#/c/349229/ should fix my issue. Could you have a look? [15:29:25] gehel: ""$ARG1$" ??? [15:29:39] (03PS3) 10Alexandros Kosiaris: Make varnishkafka log producer descriptions unique [puppet] - 10https://gerrit.wikimedia.org/r/349224 (https://phabricator.wikimedia.org/T163286) [15:29:42] (03PS2) 10Gehel: elasticsearch - icinga check has special characters that need escaping [puppet] - 10https://gerrit.wikimedia.org/r/349229 [15:29:45] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Make varnishkafka log producer descriptions unique [puppet] - 10https://gerrit.wikimedia.org/r/349224 (https://phabricator.wikimedia.org/T163286) (owner: 10Alexandros Kosiaris) [15:29:50] my editor just decided to be too smart for me :( [15:29:55] ehehhe [15:30:41] anyway, yes [15:30:55] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/349229 (owner: 10Gehel) [15:30:55] that $ARG1$ is in fact ">=0.34", and my guess is that this ">" needs escaping in some way... [15:31:11] (03PS3) 10Gehel: elasticsearch - icinga check has special characters that need escaping [puppet] - 10https://gerrit.wikimedia.org/r/349229 [15:33:09] 06Operations, 10ops-eqiad: rack and cable frlog1001 - https://phabricator.wikimedia.org/T163127#3197974 (10Jgreen) @cmjohnson this is for the replacement box for indium, I merged in T163361 but I'm not sure whether phabricator would have notified you. [15:33:20] (03CR) 10Gehel: [C: 032] elasticsearch - icinga check has special characters that need escaping [puppet] - 10https://gerrit.wikimedia.org/r/349229 (owner: 10Gehel) [15:33:27] PROBLEM - puppet last run on lvs2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:34:17] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: (null) [15:34:47] PROBLEM - Host lvs2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:34:47] PROBLEM - puppet last run on mw2108 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:36:37] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: move frdb1002 from pfw1 to pfw2 - https://phabricator.wikimedia.org/T163268#3197979 (10Jgreen) a:05Jgreen>03Cmjohnson @Cmjohnson assigning this to you. The destination port on pfw1 should be ready to go, so just give me a little warning before you do the... [15:36:38] PROBLEM - MariaDB Slave Lag: s4 on db1068 is CRITICAL: CRITICAL slave_sql_lag could not connect [15:36:39] PROBLEM - DPKG on cp1008 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:36:41] PROBLEM - Host ocg1001 is DOWN: PING CRITICAL - Packet loss = 100% [15:36:43] PROBLEM - Check systemd state on elastic2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:36:44] PROBLEM - puppet last run on elastic2020 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 14 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[elasticsearch] [15:36:49] PROBLEM - mysqld processes on db1068 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [15:36:49] PROBLEM - Check systemd state on phab2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:36:59] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 73, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_sha [15:37:00] PROBLEM - MariaDB Slave Lag: s4 on db1081 is CRITICAL: CRITICAL slave_sql_lag could not connect [15:37:05] PROBLEM - mysqld processes on db1081 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [15:37:18] you can ignore 68 and 81 [15:37:18] how is that possible? I downtimed them till monday [15:37:19] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [15:37:34] marostegui: that might actually be my fault [15:37:40] ah [15:37:54] lvs2002 "DOWN" is just icinga, it's actually both still up and still not servicing live traffic (2005 has taken over)) [15:38:09] RECOVERY - Host lvs2002 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [15:38:29] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 73, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_sha [15:38:40] akosiaris: downtimed then again [15:38:47] gehel: if you want to force a puppet run quickly on all of them [15:38:52] marostegui: thanks ... sorry about that [15:38:53] sudo cumin 'R:class = elasticsearch::nagios::check' 'run-puppet-agent' [15:38:59] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [15:39:00] what's with ocg1001 host down around the same time as elastic2020 and the db alerts, etc? [15:39:09] RECOVERY - ElasticSearch health check for shards on logstash1005 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 73, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_sha [15:39:09] PROBLEM - Check systemd state on restbase1018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:39:09] PROBLEM - cassandra-c service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [15:39:13] that was a lot of seemingly-unrelated things all at once [15:39:21] volans: Oh, I should have done that! I was still using salt... [15:39:24] bblack: icinga service restart [15:39:36] oh ok! [15:39:37] so all scheduled downtimes were lost? [15:39:42] akosiaris: no worries, at least it is under control :) [15:39:45] not all [15:39:49] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 73, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_sha [15:39:49] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.97, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f01d8cf4950: Failed to establish a new connection: [Errno 111] Connection refused,)) [15:39:56] just the last few hours [15:40:00] heh makes sense, cp1008 alerted too, it's long-term perma-downtime [15:40:03] (usually) [15:40:10] PROBLEM - MD RAID on restbase1018 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 [15:40:11] ACKNOWLEDGEMENT - MD RAID on restbase1018 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T163454 [15:40:16] 06Operations, 10ops-eqiad: Degraded RAID on restbase1018 - https://phabricator.wikimedia.org/T163454#3198011 (10ops-monitoring-bot) [15:40:19] PROBLEM - Restbase root url on restbase1018 is CRITICAL: connect to address 10.64.48.97 and port 7231: Connection refused [15:40:40] PROBLEM - cassandra-a CQL 10.64.48.98:9042 on restbase1018 is CRITICAL: connect to address 10.64.48.98 and port 9042: Connection refused [15:40:59] PROBLEM - cassandra-a SSL 10.64.48.98:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:41:09] PROBLEM - cassandra-a service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [15:41:29] PROBLEM - cassandra-b CQL 10.64.48.99:9042 on restbase1018 is CRITICAL: connect to address 10.64.48.99 and port 9042: Connection refused [15:41:39] PROBLEM - cassandra-b SSL 10.64.48.99:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:41:49] PROBLEM - cassandra-b service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [15:42:09] PROBLEM - cassandra-c CQL 10.64.48.100:9042 on restbase1018 is CRITICAL: connect to address 10.64.48.100 and port 9042: Connection refused [15:42:19] PROBLEM - cassandra-c SSL 10.64.48.100:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:42:52] 06Operations, 06Labs: wikitech logging constant errors from /MemcachedPeclBagOStuff.php - https://phabricator.wikimedia.org/T163439#3198029 (10bd808) [15:43:25] ah this is expired downtime for sure [15:43:27] ACKNOWLEDGEMENT - Check systemd state on restbase1018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. alexandros kosiaris T163292 [15:43:27] ACKNOWLEDGEMENT - MD RAID on restbase1018 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 alexandros kosiaris T163292 [15:43:27] ACKNOWLEDGEMENT - Restbase root url on restbase1018 is CRITICAL: connect to address 10.64.48.97 and port 7231: Connection refused alexandros kosiaris T163292 [15:43:27] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.48.98:9042 on restbase1018 is CRITICAL: connect to address 10.64.48.98 and port 9042: Connection refused alexandros kosiaris T163292 [15:43:28] ACKNOWLEDGEMENT - cassandra-a SSL 10.64.48.98:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused alexandros kosiaris T163292 [15:43:28] ACKNOWLEDGEMENT - cassandra-a service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed alexandros kosiaris T163292 [15:43:28] ACKNOWLEDGEMENT - cassandra-b CQL 10.64.48.99:9042 on restbase1018 is CRITICAL: connect to address 10.64.48.99 and port 9042: Connection refused alexandros kosiaris T163292 [15:43:29] ACKNOWLEDGEMENT - cassandra-b SSL 10.64.48.99:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused alexandros kosiaris T163292 [15:43:29] ACKNOWLEDGEMENT - cassandra-b service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed alexandros kosiaris T163292 [15:43:30] ACKNOWLEDGEMENT - cassandra-c CQL 10.64.48.100:9042 on restbase1018 is CRITICAL: connect to address 10.64.48.100 and port 9042: Connection refused alexandros kosiaris T163292 [15:43:30] ACKNOWLEDGEMENT - cassandra-c SSL 10.64.48.100:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused alexandros kosiaris T163292 [15:43:31] ACKNOWLEDGEMENT - cassandra-c service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed alexandros kosiaris T163292 [15:43:31] ACKNOWLEDGEMENT - restbase endpoints health on restbase1018 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.97, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f1d6a221950: Failed to establish a new connection: [Errno 111] Connection refused,)) alexandros kosiaris T163292 [15:43:37] supe [15:43:39] *super [15:43:51] 06Operations, 06Labs: Update documentation for Tools Proxy failover - https://phabricator.wikimedia.org/T163390#3198035 (10chasemp) first pass https://wikitech.wikimedia.org/w/index.php?title=Nova_Resource%3ATools%2FAdmin&type=revision&diff=1757014&oldid=1756658 [15:44:19] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:44:49] ocg1001 seems legit though [15:44:58] has a disk broken [15:45:15] https://phabricator.wikimedia.org/T161158 [15:45:40] ACKNOWLEDGEMENT - Host ocg1001 is DOWN: PING CRITICAL - Packet loss = 100% alexandros kosiaris https://phabricator.wikimedia.org/T161158 [15:46:28] 06Operations, 10ops-eqiad: Degraded RAID on ocg1001 - https://phabricator.wikimedia.org/T161158#3123279 (10Volans) Relating it also to T155692 [15:48:59] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:49:55] (03PS1) 10Rush: wmcs: change shared shinken 'puppet run' to 'puppet errors' [puppet] - 10https://gerrit.wikimedia.org/r/349233 [15:51:48] PROBLEM - puppet last run on lvs2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:55:15] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 15User-Elukey: Reclaim/Decommission mw2090->mw2096 (OOW) - https://phabricator.wikimedia.org/T161488#3198091 (10Papaul) [15:55:27] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3198093 (10Halfak) +1 [15:56:44] (03CR) 10Madhuvishy: [C: 031] wmcs: change shared shinken 'puppet run' to 'puppet errors' [puppet] - 10https://gerrit.wikimedia.org/r/349233 (owner: 10Rush) [15:57:53] RECOVERY - mysqld processes on db1068 is OK: PROCS OK: 1 process with command name mysqld [15:58:08] RECOVERY - MariaDB Slave Lag: s4 on db1068 is OK: OK slave_sql_lag not a slave [16:01:47] jynus: db1068 catching up [16:01:53] puppet is disabled on the host [16:01:57] feel free to merge your patch [16:02:04] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3198126 (10akosiaris) [16:02:07] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: switchover icinga.wikimedia.org from einsteinium to tegmen - https://phabricator.wikimedia.org/T163324#3198124 (10akosiaris) 05Open>03Resolved After a long day and a rabbithole, along with some unwanted pages, this is done. [16:02:08] RECOVERY - puppet last run on mw2108 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [16:02:24] RECOVERY - mysqld processes on db1081 is OK: PROCS OK: 1 process with command name mysqld [16:02:46] marostegui, I double checked the log is in statement [16:02:58] good :) [16:03:07] not that I do not trust you [16:03:13] no no, I am happy you did [16:03:13] I do not trust mysql config [16:03:18] RECOVERY - MariaDB Slave IO: s4 on db1081 is OK: OK slave_io_state Slave_IO_Running: Yes [16:03:18] RECOVERY - MariaDB Slave SQL: s4 on db1081 is OK: OK slave_sql_state Slave_SQL_Running: Yes [16:03:25] (03PS2) 10Alexandros Kosiaris: tcpircbot: Subscribe to the correct File resource [puppet] - 10https://gerrit.wikimedia.org/r/349186 [16:03:31] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] tcpircbot: Subscribe to the correct File resource [puppet] - 10https://gerrit.wikimedia.org/r/349186 (owner: 10Alexandros Kosiaris) [16:04:17] (03CR) 10Jcrespo: [C: 032] mariadb: promote db1068 as the new s4 master on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/349220 (https://phabricator.wikimedia.org/T163110) (owner: 10Jcrespo) [16:04:22] (03PS3) 10Jcrespo: mariadb: promote db1068 as the new s4 master on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/349220 (https://phabricator.wikimedia.org/T163110) [16:04:41] (03CR) 10Jcrespo: [C: 032] " feel free to merge your patch" [puppet] - 10https://gerrit.wikimedia.org/r/349220 (https://phabricator.wikimedia.org/T163110) (owner: 10Jcrespo) [16:06:14] (03PS1) 10Jgreen: use fr-tech-ops/fr-tech contacts to catch and deliver icinga warnings [puppet] - 10https://gerrit.wikimedia.org/r/349236 [16:07:18] PROBLEM - puppet last run on tegmen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:07:26] marostegui, wait, where does db1068 come from? [16:07:29] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: codfw rack/setup 22 DB servers - https://phabricator.wikimedia.org/T162159#3198158 (10Papaul) @Marostegui Thanks for for update. [16:07:31] db1081 [16:07:35] (03PS2) 10Jgreen: use fr-tech-ops/fr-tech contacts to catch and deliver icinga warnings [puppet] - 10https://gerrit.wikimedia.org/r/349236 [16:07:41] no, which shard was before? [16:07:46] s4 [16:08:02] it wasn't on grafana [16:08:09] (prometheus) [16:08:11] that is strange :| [16:08:12] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: codfw rack/setup 22 DB servers - https://phabricator.wikimedia.org/T162159#3198161 (10Papaul) [16:08:24] !log uploaded piwik 2.17.1-1 to jessie-wikimedia main [16:08:27] it has not been moved from the shard [16:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:15] (03PS1) 10Jcrespo: prometheus-myqsld-exporter: Promote db1068 to the s4 master [puppet] - 10https://gerrit.wikimedia.org/r/349238 (https://phabricator.wikimedia.org/T163110) [16:09:42] 06Operations, 10Analytics: sync bohrium and apt.wikimedia.org piwik versions - https://phabricator.wikimedia.org/T149993#3198165 (10elukey) 05Open>03Resolved a:03elukey Just upgraded Piwik on bohrium and uploaded the new deb (retrieved from https://debian.piwik.org) to jessie-wikimedia main (as it was do... [16:10:13] marostegui, I am not blind, right? https://gerrit.wikimedia.org/r/#/c/349238/1/modules/role/files/prometheus/mysql-core_eqiad.yaml [16:10:19] RECOVERY - PyBal backends health check on lvs2002 is OK: PYBAL OK - All pools are healthy [16:10:38] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:10:48] RECOVERY - pybal on lvs2002 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [16:10:49] (03CR) 10Jgreen: [C: 032] use fr-tech-ops/fr-tech contacts to catch and deliver icinga warnings [puppet] - 10https://gerrit.wikimedia.org/r/349236 (owner: 10Jgreen) [16:11:00] jynus: no, it is not there or anywhere else indeed [16:11:31] that is a good sign, it means it never had performance issues and we never had to look at its graphs! [16:11:42] 06Operations, 06DC-Ops, 10Traffic, 10netops, 13Patch-For-Review: Interface errors on asw-c-codfw:xe-7/0/46 - https://phabricator.wikimedia.org/T163323#3198174 (10ayounsi) 05Open>03Resolved papaul replaced the SFP on the switch side. Stress-testing done with bblack, no more interfaces errors. [16:12:08] RECOVERY - MariaDB Slave Lag: s4 on db1081 is OK: OK slave_sql_lag Replication lag: 0.08 seconds [16:12:37] ^ the ssds… <3 [16:13:03] (03PS1) 10Jcrespo: s4.hosts: set db1068 is the new s4 master [software] - 10https://gerrit.wikimedia.org/r/349240 (https://phabricator.wikimedia.org/T163110) [16:13:23] (03CR) 10Marostegui: [C: 031] s4.hosts: set db1068 is the new s4 master [software] - 10https://gerrit.wikimedia.org/r/349240 (https://phabricator.wikimedia.org/T163110) (owner: 10Jcrespo) [16:14:12] (03CR) 10Jcrespo: [V: 032 C: 032] s4.hosts: set db1068 is the new s4 master [software] - 10https://gerrit.wikimedia.org/r/349240 (https://phabricator.wikimedia.org/T163110) (owner: 10Jcrespo) [16:14:14] (03PS1) 10Gehel: wdqs - monitor response times for both eqiad and codfw [puppet] - 10https://gerrit.wikimedia.org/r/349241 [16:15:35] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349246 [16:15:39] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349246 [16:17:26] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349246 (owner: 10Marostegui) [16:17:32] (03CR) 10Jcrespo: [C: 032] prometheus-myqsld-exporter: Promote db1068 to the s4 master [puppet] - 10https://gerrit.wikimedia.org/r/349238 (https://phabricator.wikimedia.org/T163110) (owner: 10Jcrespo) [16:17:37] (03PS2) 10Jcrespo: prometheus-myqsld-exporter: Promote db1068 to the s4 master [puppet] - 10https://gerrit.wikimedia.org/r/349238 (https://phabricator.wikimedia.org/T163110) [16:17:41] (03PS1) 10Alexandros Kosiaris: tcpircbot: Also remove requirement of ${title}.json [puppet] - 10https://gerrit.wikimedia.org/r/349247 [16:18:21] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] tcpircbot: Also remove requirement of ${title}.json [puppet] - 10https://gerrit.wikimedia.org/r/349247 (owner: 10Alexandros Kosiaris) [16:18:29] !log depool varnish-be on cp2017 [16:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:35] (03PS1) 10Jgreen: put cmjohnson back in fr-tech-ops contact list [puppet] - 10https://gerrit.wikimedia.org/r/349248 [16:19:24] ok, so the plan is to migrate all s4 eqiad slaves below db1068 [16:19:30] sounds good [16:19:31] elukey: can i take mw2256 down? [16:19:39] jynus: going to get the mediawiki patch ready in a sec [16:19:43] stop replication because we can now [16:20:01] promote 68 [16:20:04] deply the patch [16:20:07] (03Abandoned) 10Jgreen: put cmjohnson back in fr-tech-ops contact list [puppet] - 10https://gerrit.wikimedia.org/r/349248 (owner: 10Jgreen) [16:20:17] not sure if in that order [16:20:19] jynus: let's also move db1040 under db1068 [16:20:26] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349246 (owner: 10Marostegui) [16:20:35] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349246 (owner: 10Marostegui) [16:20:41] before the patch? [16:20:56] no, after [16:21:00] papaul: o/ - let me check and shut it down in case [16:21:04] yes, that is for sure [16:21:13] although dbstore1001 will have its problems [16:21:14] and restore replication with codfw? [16:21:19] RECOVERY - puppet last run on tegmen is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [16:21:23] yes [16:21:31] I mean before or after the patch? [16:21:47] after the patch better [16:21:51] or not? [16:22:13] making it easier makes it harder to select the best option :-) [16:22:22] hahaha [16:22:27] let's do it before the patch [16:22:35] just let's stop replication [16:22:43] and kill heartbeat everywhere [16:22:46] so we do not have issues [16:22:57] stop puppet on 68 and 40 [16:23:02] and kill heartbeat [16:23:13] that will makes sure we do not have issues [16:24:02] oh, but 68 is still catching up? [16:24:09] papaul: mw2256 is shutting down, just scheduled 2 hours of maintenance [16:24:51] (03PS1) 10Marostegui: db-eqiad.php: Promote db1068 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349249 (https://phabricator.wikimedia.org/T162133) [16:24:52] are we sure that such a huge difference between hds and ssds is normal? [16:25:12] jynus: Not the first time I see this difference [16:25:18] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: codfw rack/setup 22 DB servers - https://phabricator.wikimedia.org/T162159#3198283 (10Papaul) @Marostegui none of the systems were ready. [16:25:18] jynus: review that patch carefully please [16:26:07] (03CR) 10Marostegui: [C: 04-2] "do not deploy yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349249 (https://phabricator.wikimedia.org/T162133) (owner: 10Marostegui) [16:26:59] (03CR) 10Jcrespo: [C: 031] "Looks good when 68 catches up." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349249 (https://phabricator.wikimedia.org/T162133) (owner: 10Marostegui) [16:27:44] maybe we can convert db1040 to file per table and use it for cloning if needed? [16:27:52] or just throw it away :-/ [16:27:56] elukey: thanks [16:28:01] elukey: the [16:28:04] That is not a bad idea jynus! [16:28:10] (03PS2) 10Rush: wmcs: change shared shinken 'puppet run' to 'puppet errors' [puppet] - 10https://gerrit.wikimedia.org/r/349233 [16:28:12] probably too slow [16:28:16] realiable [16:28:24] but too slow [16:28:39] if the big servers were recloned from it [16:28:53] however, I do not trust 53, 56, 59 and 64 [16:29:10] I can keep converting the to file per table: https://phabricator.wikimedia.org/T161088 [16:29:14] we can keep it aroud from some time [16:29:25] for pt-checksum [16:29:33] yeah [16:29:58] pt-table-checksum finished on that shard, already so "only" pending your compare.py [16:30:04] oh [16:30:08] no differences? [16:30:16] I cannot believe it [16:30:25] ah, no old_image checks [16:30:27] haha i never said that! [16:30:34] https://phabricator.wikimedia.org/T162593 [16:30:42] archive and oldimage are the ones with unsafe statements [16:30:57] can I enable puppet on db1068 right? [16:31:00] just what I said [16:31:07] yes [16:31:12] no [16:31:13] wait [16:31:19] that will launch pt-heartbeat [16:31:21] 06Operations, 10hardware-requests: EQIAD: 2 hardware access request for kubernetes-staging - https://phabricator.wikimedia.org/T162257#3198298 (10RobH) [16:31:22] ah [16:31:35] 06Operations, 10hardware-requests: CODFW: (4) hardware access request for kubernetes - https://phabricator.wikimedia.org/T161700#3198300 (10RobH) [16:31:36] good one [16:31:38] let have 0 writes until the actual failover [16:31:50] out-of-band writes, I mean [16:32:10] sure [16:32:50] so waiting for the sync, then we do the topology changes and the deploy [16:32:51] (03CR) 10Rush: [C: 032] wmcs: change shared shinken 'puppet run' to 'puppet errors' [puppet] - 10https://gerrit.wikimedia.org/r/349233 (owner: 10Rush) [16:32:52] 06Operations, 10fundraising-tech-ops: Revisit paging strategy for frack servers - https://phabricator.wikimedia.org/T163368#3198304 (10Dzahn) > So as of today if "sms" does not show up in contact_groups for a host or service, individual Ops don't get email or sms notification. If that's correct, we're much clo... [16:33:01] I can take it from here [16:33:04] you may be tired [16:33:16] No, I am not leaving you here :) [16:33:24] (03PS3) 10Jcrespo: prometheus-myqsld-exporter: Promote db1068 to the s4 master [puppet] - 10https://gerrit.wikimedia.org/r/349238 (https://phabricator.wikimedia.org/T163110) [16:33:29] I will be here till the switchover is done [16:35:58] 06Operations, 10DBA, 13Patch-For-Review, 05codfw-rollout: codfw API slaves overloaded during the 2017-04-19 codfw switch - https://phabricator.wikimedia.org/T163351#3198322 (10Marostegui) Just for the record, I finished an alter table on the revision table on db1065, and it is not filesorting. ``` root@db... [16:37:23] (03PS2) 10Andrew Bogott: toollabs: iterate bigbrother job dict values not keys [puppet] - 10https://gerrit.wikimedia.org/r/348885 (https://phabricator.wikimedia.org/T163265) (owner: 10BryanDavis) [16:37:27] (03CR) 10Hashar: [C: 031] Jenkins: install jdk, not just jre [puppet] - 10https://gerrit.wikimedia.org/r/348961 (owner: 10Chad) [16:39:24] (03CR) 10Andrew Bogott: [C: 032] toollabs: iterate bigbrother job dict values not keys [puppet] - 10https://gerrit.wikimedia.org/r/348885 (https://phabricator.wikimedia.org/T163265) (owner: 10BryanDavis) [16:43:59] RECOVERY - Check Varnish expiry mailbox lag on cp2017 is OK: OK: expiry mailbox lag is 0 [16:46:18] !log repool varnish-be on cp2017 [16:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:29] jynus almost there! [16:48:38] I see [16:48:49] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:50:17] I have checked again the binlog to make sure it is statement [16:55:20] so we go? [16:55:35] yes! [16:55:52] ok, so I am going to move the slaves first [16:55:59] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=8888): Max retries exceeded with url: /?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fa571c1b850: Failed to establish a new connection: [Errno 111] Connection refused,)) [16:56:07] ok [16:56:16] at least the ones with GTID [16:56:26] good [16:56:33] mobrovac: ^^ : Generic error? [16:56:56] (03CR) 10Andrew Bogott: [C: 04-1] "So apparently until nginx 1.11.11, HUP just... doesn't work? It seems to leave zombie workers behind, forever." [puppet] - 10https://gerrit.wikimedia.org/r/348954 (owner: 10Andrew Bogott) [16:56:59] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [16:57:09] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [16:57:58] bearND: recovered ^, a temp hick-up [16:58:07] !log moving GTID s4 eqiad replicas under db1068 [16:58:07] yup [16:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:59] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [16:59:48] marostegui, do you see tendril? [16:59:52] yes checking it [17:00:06] shout if you see errors or something [17:00:09] I am seeing the changes [17:00:10] so far so good [17:00:33] aren't you loving gtid? :) [17:00:45] we had a script [17:00:47] :-) [17:01:12] that is why I am only interested on the transactional replication functionality [17:01:21] but gtid is more reliable... when it works [17:01:31] hehe [17:01:35] which script?! [17:01:45] the one to stop replicas in sync [17:01:49] PROBLEM - Check Varnish expiry mailbox lag on cp2002 is CRITICAL: CRITICAL: expiry mailbox lag is 671505 [17:01:50] it is on dbtools [17:01:53] aah that one [17:01:54] ok ok [17:02:01] I thought you meant to do all the move [17:02:11] we can do one of those too [17:02:36] (03PS2) 10Andrew Bogott: Fix configuration of size limits to allow paged LDAP search requests [puppet] - 10https://gerrit.wikimedia.org/r/348920 (https://phabricator.wikimedia.org/T162745) (owner: 10Muehlenhoff) [17:02:56] do 69 have GTIDs? [17:02:59] PROBLEM - Check Varnish expiry mailbox lag on cp2022 is CRITICAL: CRITICAL: expiry mailbox lag is 604719 [17:03:06] checking [17:03:21] Using_Gtid: No [17:03:32] yeah [17:03:38] I will leave it for later [17:05:10] so you are using the: --switch-sibling-to-child [17:05:11] ? [17:05:20] no, not now [17:05:27] just gtid gives me that [17:05:41] but that was what we used to use [17:05:43] ah right, so just change master to master_host bla [17:05:56] STOP SLAVE; SELECT sleep(1); CHANGE MASTER TO MASTER_HOST='db1068.eqiad.wmnet'; START SLAVE; [17:06:01] yep :) [17:06:11] the sleep 1 is not needed, but I wanted to add it [17:06:23] haha, I have the same issue when I do stop slave [17:06:27] I do it a few times sometimes [17:06:43] it is so that it doesn't error out [17:06:46] if it lags [17:06:48] or something [17:07:05] ok, so lets disable puppet on 69 and 64 [17:07:10] kill pt-heartbeat [17:07:16] and deploy? [17:07:19] ok, disabling puppet there [17:07:20] yep [17:07:27] (03PS1) 10Dzahn: nagios_common: add notification command for fundraising irc [puppet] - 10https://gerrit.wikimedia.org/r/349255 (https://phabricator.wikimedia.org/T163368) [17:07:36] db1069 disabled [17:07:45] we can stop replication on db1040 to be 100% sure [17:07:55] sure, we can afford it now :) [17:08:09] why db1064? [17:08:13] ah, sanitarium2 master [17:08:23] sorry [17:08:23] no [17:08:25] I meant [17:08:30] 68 [17:08:33] and 40 [17:08:36] sorry [17:08:44] old and new master [17:09:07] then I am running [17:09:13] root@db1068.eqiad.wmnet[(none)]> STOP SLAVE; SELECT sleep(1); CHANGE MASTER TO MASTER_HOST='db2019.eqiad.wmnet'; START SLAVE; [17:09:27] disabled on 40 and 68 [17:09:41] kill pt-heartbeat? [17:09:56] !log disabling puppet on serpens, seaborgium, pollux, dubnium, labservices1001, labservices1002 for tentative rollout of https://gerrit.wikimedia.org/r/#/c/348920/ [17:09:57] not a huge issue [17:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:08] it is not running on db1040, you killed it? [17:10:11] but when changing topology, I want to avoid circular stuff [17:10:23] it shouldn't be running because it is not a master on puppet [17:10:28] it may be running on 68 [17:10:47] no, because puppet was stopped [17:10:47] (03CR) 10Andrew Bogott: [C: 032] Fix configuration of size limits to allow paged LDAP search requests [puppet] - 10https://gerrit.wikimedia.org/r/348920 (https://phabricator.wikimedia.org/T162745) (owner: 10Muehlenhoff) [17:10:52] ok, then [17:10:53] so we are good on that regard [17:11:08] going to start merging (but not deploying) [17:11:20] (03PS2) 10Dzahn: nagios_common: add notification command for fundraising irc [puppet] - 10https://gerrit.wikimedia.org/r/349255 (https://phabricator.wikimedia.org/T163368) [17:11:38] so merge now? [17:11:51] everything is read only, so this is a bit redundant [17:11:51] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Promote db1068 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349249 (https://phabricator.wikimedia.org/T162133) (owner: 10Marostegui) [17:12:01] (03PS3) 10Dzahn: nagios_common: add notification command for fundraising irc [puppet] - 10https://gerrit.wikimedia.org/r/349255 (https://phabricator.wikimedia.org/T163368) [17:12:30] I am stopping the slave on db1040 [17:12:34] ok [17:12:54] (03Merged) 10jenkins-bot: db-eqiad.php: Promote db1068 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349249 (https://phabricator.wikimedia.org/T162133) (owner: 10Marostegui) [17:13:03] !log stopping replication on db1040 [17:13:06] (03CR) 10jenkins-bot: db-eqiad.php: Promote db1068 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349249 (https://phabricator.wikimedia.org/T162133) (owner: 10Marostegui) [17:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:11] jynus: going to deploy, ok? [17:13:27] yes [17:13:34] running scap [17:13:56] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 13Patch-For-Review: Use SSL certificates with discovery entry for elasticsearch - https://phabricator.wikimedia.org/T162037#3150511 (10Deskana) This seems stalled. Is there anyone in Ops or Discovery that can review this? [17:14:07] 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work), 13Patch-For-Review: Use SSL certificates with discovery entry for elasticsearch - https://phabricator.wikimedia.org/T162037#3198533 (10Deskana) [17:14:22] 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work), 13Patch-For-Review: Use SSL certificates with discovery entry for elasticsearch - https://phabricator.wikimedia.org/T162037#3150511 (10EBernhardson) >>! In T162037#3150531, @Gehel wrote: > I think that the transfer_to_es job i... [17:14:52] I can see it well on: https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php&holaholahola [17:14:56] ups [17:15:02] I used the wrong string [17:15:15] but nobody has to know that, it justfailed to connect [17:15:43] which string? [17:15:48] the server string [17:15:56] ah [17:15:56] io error- so nothing happened :-) [17:16:03] haha eqiad.wmnet [17:16:03] XD [17:16:13] moving now db1040 [17:16:14] not the first time and last till it will happen no [17:16:15] ok [17:17:12] uh, duplicate error [17:17:18] crap [17:17:29] PROBLEM - MariaDB Slave SQL: s4 on db1040 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 588993966 for key PRIMARY on query. Default database: commonswiki. [Query snipped] [17:18:05] 06Operations, 10fundraising-tech-ops, 13Patch-For-Review: Revisit paging strategy for frack servers - https://phabricator.wikimedia.org/T163368#3198569 (10Dzahn) The way the custom IRC notifications work: - add a special notification command which writes to a new logfile (https://gerrit.wikimedia.org/r/34... [17:18:25] why? [17:18:49] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [17:18:55] it didn't reposition wekk [17:19:08] it started to replicate from 000001:1 [17:19:20] makes not much sense no [17:19:39] oh, I know why [17:19:45] the master doesn't have gtid [17:19:58] it is salvagable [17:20:05] which master? db2019? [17:20:11] db1040 [17:20:19] it doesn have gtids enabled [17:20:20] aaah right, we always have it disable [17:20:31] it is fixable [17:20:39] but let's focus on the other hosts first [17:21:16] let's run puppet everywhere [17:21:20] ok [17:21:25] running it on db1068 [17:21:28] with noop first [17:21:40] exec mnaster pos is 702 which means only 1 or 2 events have been executed [17:21:55] we can revert those and reposition the old master on the right place [17:22:11] there is no reverse replication [17:22:14] so no problem [17:22:30] db1068 looks good, so I am going to fully run it [17:22:36] (puppet) [17:22:58] I have disabled the alert on db1040 [17:23:04] the other will complain but not page [17:23:26] puppet ran and pt-heartbeat is up too on db1068 [17:23:45] should we create the replication link now or later? [17:23:50] I would do it now [17:23:53] so we can forget about it [17:24:00] 1068 -> 2019 [17:24:07] yes [17:24:21] ok, running change master on db2019 [17:24:55] this is the only potential user breaking change [17:25:39] db1040 puppet enabled [17:25:54] labs must be lagging now [17:26:15] (03PS4) 10Dzahn: nagios_common: add notification command for fundraising irc [puppet] - 10https://gerrit.wikimedia.org/r/349255 (https://phabricator.wikimedia.org/T163368) [17:26:29] (03CR) 10Dzahn: [C: 032] nagios_common: add notification command for fundraising irc [puppet] - 10https://gerrit.wikimedia.org/r/349255 (https://phabricator.wikimedia.org/T163368) (owner: 10Dzahn) [17:27:19] PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 871.42 seconds [17:27:23] there we go [17:27:29] i will silence it [17:27:35] it is ok [17:27:39] we are going to fix it soon [17:28:01] CHANGE MASTER TO MASTER_HOST='db1068.eqiad.wmnet', MASTER_USER='repl', [17:28:08] MASTER_LOG_FILE='db1068-bin.000001', MASTER_LOG_POS=713913759, MASTER_SSL=1; [17:28:14] let me double check [17:28:30] I am going to run almost that [17:28:45] go for it, looks good [17:28:50] db1068 has GTID [17:29:03] on db2019 [17:29:07] yep [17:29:08] go for it [17:29:42] looking good [17:29:45] looks ok, 0 lag [17:30:03] 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work), 13Patch-For-Review: Use SSL certificates with discovery entry for elasticsearch - https://phabricator.wikimedia.org/T162037#3198646 (10Gehel) I should be the one reviewing this... [17:30:13] ok, let's check relay or binlog to fix db1040 [17:30:36] I wonder if it will be a mess becase no gtid? [17:30:40] for the slave [17:30:49] ah, the slaves are not on gtid [17:30:51] so no issue [17:30:58] it shouldn't be [17:30:59] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:31:49] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:32:34] there is only one statement [17:32:56] did you see it already? [17:33:02] i think i do [17:33:16] is it the globalimagelinks one? [17:33:29] I am trying to coorelate all 4 coordinates :-) still [17:33:59] i am only checking so far the db1040 error and the statements before that one [17:35:03] oh, remember semisync [17:35:17] that is why I did want to run puppet before [17:35:19] all that stuff [17:36:20] so [4-702) has been executed [17:36:33] on db1068-bin.000001 [17:37:03] yes [17:37:15] gtid is evil [17:37:35] becaues it doesn't warn you if you just run cHANGE master without params :-) [17:37:48] so from that binlog I only see the insert on globalimagelinks [17:37:50] happily, our writes are anti-idempotent [17:38:01] and right after the one that fails [17:38:34] (03PS5) 10Dzahn: nagios_common: add notification command for fundraising irc [puppet] - 10https://gerrit.wikimedia.org/r/349255 (https://phabricator.wikimedia.org/T163368) [17:38:36] (03PS1) 10Dzahn: nagios_common: add IRC notifications for #wikimedia-fundraising [puppet] - 10https://gerrit.wikimedia.org/r/349259 (https://phabricator.wikimedia.org/T163368) [17:39:24] agree [17:39:33] so a delete with no binlog [17:39:36] to not make it worse [17:39:43] hehe [17:39:57] and then we have to find the *right* coordinates [17:40:22] you know that when I ask you to double check my command is for something :-) [17:40:34] haha [17:41:15] does that table have a primary key? [17:41:20] let me guess, no? [17:41:33] it dooooes [17:41:44] (i thought the same) [17:41:49] ohhhhhhhhh [17:42:15] maybe you can write the delete and i seach the coodinates old-fashion? [17:42:16] but I might be reading the paramenters wrong because I see the rc_id with NULL [17:42:26] autoincrement [17:42:33] no, I mean the insert that failed [17:42:35] NULL means use the autoinc [17:42:56] SET INSERT_ID= [17:43:00] see it above? [17:43:10] it is STATEMENT, do not expect much :-) [17:43:18] (03CR) 10Dzahn: [C: 032] nagios_common: add IRC notifications for #wikimedia-fundraising [puppet] - 10https://gerrit.wikimedia.org/r/349259 (https://phabricator.wikimedia.org/T163368) (owner: 10Dzahn) [17:43:28] ah [17:43:28] yes [17:43:45] so forget about the next one [17:43:50] write the delete [17:43:55] I will search the coords [17:43:57] ok [17:44:01] or the the other way round [17:44:09] no worries, I will check the delete yes [17:45:44] I confirm all the above is right [17:45:52] on the slave's binlog [17:46:07] (I was checking it on the master's binlog before) [17:46:11] Is something up with our API due to the backend switch? It's been throwing "503 Backend fetch failed" errors a few times since the past 24 hours. [17:47:02] which backend and which api, you will have to be more specific? [17:47:21] jynus: I meant DC switch. And English wikipedia API. [17:47:33] action api or rest? [17:47:43] which query? [17:48:03] jynus: Query like "https://en.wikipedia.org/w/api.php?action=query&formatversion=2&format=json&titles=West%20Nile%20virus%20infection&prop=redirects&rdlimit=500" [17:48:32] WFM [17:48:34] It works intermittently. [17:48:44] can you write all details into a ticket? [17:48:49] it can be so many things! [17:49:05] Yeah. Okay, I'll look into filing a ticket. Thanks. [17:49:18] jynus: Do you know how many requests a second our API is capable of handling? [17:49:35] Overall. I'm not making a lot. [17:49:46] it depends [17:49:49] of the reequest [17:50:02] is https://phabricator.wikimedia.org/T163351 useful? [17:50:39] jynus: Doesn't seem like it. There's a ticket with the error we are seeing: https://phabricator.wikimedia.org/T163347 [17:51:18] I do not see that type of querty failing too often [17:51:31] but there are more api experts [17:51:38] *Better [17:51:42] !log restarted icinga-wm (ircecho) to pick up config change [17:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:01] /win 56 [17:52:09] It's never happened in the past two months and a bunch of times in the last few hours, which killed our bot. :( [17:52:10] the ones I see flaky are api query contributions [17:53:01] I would add somone from platform for triage [17:53:29] and then it can bounce to us, but it doesn't seem any of the ongoing issues we know of [17:53:44] jynus: https://phabricator.wikimedia.org/P5301 [17:54:27] we are lucky, marostegui, last event replicated is heartbeat at 2017-04-20T17:12:43.001050 from codfw [17:54:36] which is super-easy to find [17:55:27] there is a problem, though [17:55:37] if we run it without replication, it will not fix its slaves [17:55:52] the delete you mean? [17:55:58] yeah [17:56:10] and that is too much work (e.g. delayed slave) [17:56:20] let me check that same select on the slaves [17:56:24] to see if it is there too [17:56:33] maybe with replication and if gtid breaks, it breaks? [17:57:24] well, it is there on db1069 and dbstore1002 [17:57:30] so it shouldn't really break no? [17:57:43] no [17:57:53] I am just realistic :-) [17:57:55] xdd [17:58:04] I am checking that db1040 only has 3 hosts connected as slaves [17:58:09] run it and stop replication completely, so it doesnt start by accident [17:58:17] yes, that is on purpose [17:58:31] yeah, no, just checking if we are going to run it with log_bin=1 [17:58:33] worse case scenario, I had only broke 3 servers [17:58:37] making sure it doesn't get spread [17:58:42] to somewhere unexpected [17:58:47] lets run it with binlog on [17:58:59] ok, with binlog enabled [17:59:01] stop slave, and I will tell tell you the change master coords right now [17:59:12] I broke it, I take responsability of the fix [17:59:25] stop slave db1040 [17:59:27] and run the delete [17:59:29] right? [18:00:12] yes, in any order [18:00:19] ok [18:00:20] doing it now [18:00:32] done [18:00:34] that will not unbreak it, just revert the insert [18:00:45] yeah i know :) [18:01:25] I almost have the coords [18:01:30] great :) [18:01:33] I am doubting between 2 values [18:01:36] icinga-wm is me, it will be back [18:01:55] you want me to check them? [18:02:36] db1069 replicated fine by the way [18:02:52] i love statement based replication :p [18:03:32] db1068-bin.000001:695382174 [18:03:35] I think [18:03:54] let me see [18:04:11] 06Operations, 10Monitoring, 07LDAP, 13Patch-For-Review: allow paging to work properly in ldap - https://phabricator.wikimedia.org/T162745#3198861 (10bd808) 05Open>03Resolved Both anon and authed result paging are now working thanks to @MoritzMuehlenhoff and https://gerrit.wikimedia.org/r/#/c/348920/ [18:04:43] 06Operations, 06Labs: Update documentation for Tools Proxy failover - https://phabricator.wikimedia.org/T163390#3198863 (10chasemp) a:05chasemp>03Andrew https://wikitech.wikimedia.org/w/index.php?title=Nova_Resource%3ATools%2FAdmin&type=revision&diff=1757036&oldid=1756658 [18:04:49] last executed query is heartbeat on 2017-04-20T17:12:43.001050 [18:04:49] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [18:05:04] and that is where I found that on 68 [18:05:24] around 17:12:43 [18:05:44] 06Operations: ircecho - /etc/default/ircecho puppet issue - https://phabricator.wikimedia.org/T163476#3198865 (10Dzahn) [18:05:57] heartbeat is at 695381798 [18:06:27] next one is another heartbeat 695382174 [18:06:37] let me check with non-heartbeat statements [18:07:04] what was the other value you were doubting? [18:07:06] yes, the one after UPDATE /* Title::invalidateCache */ [18:07:16] I see 2 heartbeats from codfw [18:07:27] I was worried about why [18:07:39] but I realized that we had stopped replication [18:07:43] to make it easier [18:07:48] and it kinde did it [18:08:37] so I am 99% confident about db1068-bin.000001:695382174 [18:08:48] let's go for it then? [18:08:54] you do it or I do it? [18:09:26] ok, running reset slave all on db1040 [18:09:29] ok [18:10:16] and changing master again to db1068 [18:10:16] https://media.giphy.com/media/4KxeicCUTvhrW/giphy.gif [18:10:56] looks good! [18:11:11] ok, done [18:11:16] any complains from the slaves? [18:11:34] nope [18:11:38] at least not db1069 yet [18:12:04] w should move the slaves soon [18:12:12] as db1040 will explide anyway [18:12:17] at least dbstore1001 yes [18:12:17] XD [18:12:25] but that can be left for our next episode [18:12:57] repllag on labs [18:13:01] is it working? [18:13:08] db1069 is catching up yes [18:13:20] https://tools.wmflabs.org/replag/ [18:13:21] yay! [18:13:45] well done! :) [18:13:54] let me enable gtid [18:14:01] so it doesn't happen again :-) [18:14:05] haha [18:15:07] really nice gtid_io_pos chain [18:15:12] I just checked this: https://phabricator.wikimedia.org/T162681 [18:15:18] db1068 is affected, but only with the small downtime [18:15:23] (recable it) [18:15:34] well, we can handle that [18:15:36] yeah [18:15:41] not affected by the rack move [18:15:52] 06Operations, 06Performance-Team: Access request to Icinga control panel to acknowledge Performance alerts - https://phabricator.wikimedia.org/T163432#3197165 (10Krinkle) @Gilles That's {T163408}, right? [18:15:52] (03CR) 10Rush: tools-proxy: Ensure kubelet is stopped on tools proxy nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/349109 (https://phabricator.wikimedia.org/T163391) (owner: 10Madhuvishy) [18:15:56] (03PS2) 10Rush: tools-proxy: Ensure kubelet is stopped on tools proxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/349109 (https://phabricator.wikimedia.org/T163391) (owner: 10Madhuvishy) [18:16:19] dbstore1002 should come back any time now [18:17:01] yeah the lag is almost gone on db1069 too [18:17:12] let's disable the notifications [18:17:18] for db1040 [18:17:35] becase I released enough for go into warning [18:17:40] but not to not alert again [18:18:41] ok [18:18:49] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:20:40] so we can resolve s4 failover? [18:20:57] because: [18:21:06] I updated the task saying it was done [18:21:49] wait T163110 [18:21:50] T163110: Reclone db1068 to become a slave in s4 - https://phabricator.wikimedia.org/T163110 [18:21:58] was that on s5 before? [18:22:07] I am lost [18:22:20] RECOVERY - MariaDB Slave Lag: s4 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 217.38 seconds [18:22:22] or was it for s5? [18:22:45] what is wrong there, the title or the text? [18:22:58] The text [18:23:00] I will fix it [18:23:08] ok, so we did well? [18:23:08] it was always on s4 [18:23:11] ok [18:23:12] yes we did :) [18:23:19] so that is the one that can be closed? [18:23:29] RECOVERY - Check Varnish expiry mailbox lag on cp2011 is OK: OK: expiry mailbox lag is 29432 [18:23:30] yes I have closed it, but will update the text [18:23:33] so it is not confusing [18:23:58] (03PS4) 10Jcrespo: prometheus-myqsld-exporter: Promote db1068 to the s4 master [puppet] - 10https://gerrit.wikimedia.org/r/349238 (https://phabricator.wikimedia.org/T163110) [18:24:11] (03PS1) 10Dzahn: ircecho: add missing "ensure => present" on config file [puppet] - 10https://gerrit.wikimedia.org/r/349267 (https://phabricator.wikimedia.org/T163476) [18:24:27] you are right: https://gerrit.wikimedia.org/r/#/c/338996/1/wmf-config/db-eqiad.php [18:24:58] fixed it so it is not confusing if we have to check it again in the future [18:25:10] assigned it to you as you did almost everything! [18:25:57] so 3 master are already on the right place? [18:25:57] (03CR) 10Dzahn: [C: 032] ircecho: add missing "ensure => present" on config file [puppet] - 10https://gerrit.wikimedia.org/r/349267 (https://phabricator.wikimedia.org/T163476) (owner: 10Dzahn) [18:26:00] only 4 left? [18:26:23] 4 switchovers left! [18:26:33] hopefully less eventful [18:26:45] hopefully! :) [18:27:01] I am going to get some food now and we can finish the other hosts tomorrow I guess? [18:27:04] I cannot submit- can you? https://gerrit.wikimedia.org/r/#/c/349238/ [18:27:13] let me see [18:27:27] (03PS5) 10Marostegui: prometheus-myqsld-exporter: Promote db1068 to the s4 master [puppet] - 10https://gerrit.wikimedia.org/r/349238 (https://phabricator.wikimedia.org/T163110) (owner: 10Jcrespo) [18:27:27] go away [18:27:33] sorry, you have to rebase again because i merged [18:27:37] (03CR) 10Jcrespo: [V: 032 C: 032] prometheus-myqsld-exporter: Promote db1068 to the s4 master [puppet] - 10https://gerrit.wikimedia.org/r/349238 (https://phabricator.wikimedia.org/T163110) (owner: 10Jcrespo) [18:27:38] sniped [18:27:43] I just rebased :) [18:27:49] it didn't say conflict [18:27:52] go [18:27:57] see you tomorrow [18:27:59] hahaha [18:28:08] ok ok I can see you just merged [18:28:10] :) [18:28:16] see you tomorrow (you should go too) [18:31:05] 06Operations, 10Monitoring, 06Performance-Team: (icinga/grafana) webpagetest-alerts: slow page rendering for Internet Explorer - https://phabricator.wikimedia.org/T163408#3199037 (10Peter) Isn't this the same as we see in Chrome something related to the switchover? {F7671306} IE is alerting since IE is so f... [18:38:14] (03PS3) 10Madhuvishy: tools-proxy: Ensure kubelet is stopped on tools proxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/349109 (https://phabricator.wikimedia.org/T163391) [18:38:19] PROBLEM - Apache HTTP on mw1288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:39:09] RECOVERY - Apache HTTP on mw1288 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.130 second response time [18:40:57] 06Operations, 10Monitoring, 06Performance-Team: (icinga/grafana) webpagetest-alerts: slow page rendering for Internet Explorer - https://phabricator.wikimedia.org/T163408#3199079 (10Gilles) I believe it's due to additional latency from these locations due to the switchover. The start of it coincides exactly... [18:41:04] 06Operations, 06Performance-Team: Access request to Icinga control panel to acknowledge Performance alerts - https://phabricator.wikimedia.org/T163432#3199081 (10Gilles) Yep [18:43:46] (03Draft1) 10Paladox: ircecho: Add require => File['/etc/default/ircecho'], to base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/349269 [18:43:49] (03PS2) 10Paladox: ircecho: Add require => File['/etc/default/ircecho'], to base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/349269 [18:45:10] 06Operations, 06Performance-Team: Access request to Icinga control panel to acknowledge Performance alerts - https://phabricator.wikimedia.org/T163432#3197165 (10Dzahn) @Gilles I had already ACKed that and created T163408. The issue is that it recovered about 3 hours ago and became CRIT again. the status chan... [18:46:00] 06Operations, 10Monitoring, 06Performance-Team: (icinga/grafana) webpagetest-alerts: slow page rendering for Internet Explorer - https://phabricator.wikimedia.org/T163408#3199098 (10Gilles) And the effect looks the same as when this last happened during the DDOS-related traffic rerouting. [18:46:37] 06Operations, 06Performance-Team: Access request to Icinga control panel to acknowledge Performance alerts - https://phabricator.wikimedia.org/T163432#3199103 (10Dzahn) @Gilles could you please define who exactly is "we" in this request. Ideally a list of LDAP/wikitech user names. (what you see behind " Logged... [18:46:59] PROBLEM - MariaDB Slave Lag: s2 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.63 seconds [18:47:53] I want to deploy for https://phabricator.wikimedia.org/T163063 [18:47:59] in mediawiki [18:48:09] it seems Ops and greg-g are okay [18:48:46] checking db1047 [18:50:29] (03PS1) 10Ottomata: Add defaults in eventlogging service systemd service unit for statsd [puppet] - 10https://gerrit.wikimedia.org/r/349272 [18:52:21] (03CR) 10Ottomata: [C: 032] Add defaults in eventlogging service systemd service unit for statsd [puppet] - 10https://gerrit.wikimedia.org/r/349272 (owner: 10Ottomata) [18:52:26] (03CR) 10Ottomata: [C: 032] "No op in prod: https://puppet-compiler.wmflabs.org/6190/" [puppet] - 10https://gerrit.wikimedia.org/r/349272 (owner: 10Ottomata) [18:53:29] (03CR) 10Dzahn: [C: 032] "thx" [puppet] - 10https://gerrit.wikimedia.org/r/349269 (owner: 10Paladox) [18:53:53] (03CR) 10Dzahn: [C: 032] "just affects tegmen (since today einsteinium isnt the live Icinga) http://puppet-compiler.wmflabs.org/6189/" [puppet] - 10https://gerrit.wikimedia.org/r/349269 (owner: 10Paladox) [18:54:21] (03PS3) 10Dzahn: ircecho: Add require => File['/etc/default/ircecho'], to base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/349269 (owner: 10Paladox) [18:55:18] s2 has just gone crazy [18:55:24] is there any maintenance ongoing? [18:56:05] lots and logs of rows being written [18:56:08] *lots [18:56:50] 06Operations, 06Performance-Team: Access request to Icinga control panel to acknowledge Performance alerts - https://phabricator.wikimedia.org/T163432#3199125 (10Gilles) Thanks! The performance team, i.e. the following wikitech usernames: - Gilles - Krinkle - Aaron Schulz - Phedenskog [18:57:12] traffic almost tripled [18:57:30] i see nothing about maintenance [18:57:42] https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?orgId=1&from=now-3h&to=now&var-dc=codfw%20prometheus%2Fops&var-group=core&var-shard=s2&var-role=All [18:58:24] I see some exports, but those only create writes, not reads [18:58:59] PROBLEM - MariaDB Slave Lag: s2 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.64 seconds [18:59:29] ^that is just "too much stuff being written" [19:01:02] I think it is a ton of invalidations on ptwiki [19:01:37] (03CR) 10Krinkle: "+1 for consistency and use of explicit response instead of indirect logic via exception handler (which maybe doesn't work). E.g. use heade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349185 (owner: 10Hashar) [19:03:59] RECOVERY - MariaDB Slave Lag: s2 on db1047 is OK: OK slave_sql_lag Replication lag: 43.56 seconds [19:05:36] 06Operations, 06Labs: Initial OpenStack Neutron PoC deployment in Labtest - https://phabricator.wikimedia.org/T153099#3199148 (10chasemp) some key points I have taken ss of: {F7671560} {F7671561} {F7671562} {F7671563} [19:05:55] (03PS1) 10Jforrester: Enable mobile non-JavaScript editing on all MobileFrontend wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349274 (https://phabricator.wikimedia.org/T125174) [19:06:36] !log fixing duplicate ircecho situation - since today it should run from tegmen, the active icinga server [19:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:10] scap pull in mwdebug1002.eqiad.wmnet pulls from naos instead of mira. Is it intended? [19:07:21] Amir1: yes [19:07:21] (03CR) 10Jforrester: [C: 04-2] "Needs announcing, decision on release process." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349274 (https://phabricator.wikimedia.org/T125174) (owner: 10Jforrester) [19:07:48] akosiaris: so the deployment node is naos? [19:07:55] I fetched stuff in mira :/ [19:08:44] Amir1: https://phabricator.wikimedia.org/T162859 and https://phabricator.wikimedia.org/T162900 [19:09:24] oh thanks [19:10:22] Amir1: and if you log in into mira you will get the respective message [19:10:31] it's quite fun to look at ;-) [19:10:57] :D Sorry, I probably missed it [19:11:08] it's huge !!!! [19:11:33] half my terminal screen is occupied by it [19:12:23] elukey: Whenever you, or joe, or one of the other people tweaking TMH are around. [19:12:59] RECOVERY - Check Varnish expiry mailbox lag on cp2022 is OK: OK: expiry mailbox lag is 0 [19:13:07] (and not busy) [19:17:43] CUSTOM - Host alnilam is UP: PING OK - Packet loss = 0%, RTA = 2.65 ms [19:20:14] 06Operations, 10fundraising-tech-ops, 13Patch-For-Review: Revisit paging strategy for frack servers - https://phabricator.wikimedia.org/T163368#3199208 (10Dzahn) ``` 12:09 -!- icinga-wm [~icinga-wm@tegmen.wikimedia.org] has joined #wikimedia-fundraising 12:11 < icinga-wm> test for T163368 ``` The second lin... [19:20:23] mwdebug1002 is okay [19:21:06] I'm getting read-only errors in mwdebug but that's a different topic [19:24:43] !log start of ladsgroup@naos:/srv/mediawiki-staging/php-1.29.0-wmf.20$ scap sync-file php-1.29.0-wmf.20/extensions/ORES/includes/Hooks.php '[[gerrit:349271|Disable ORES in Recentchangeslinked]] (T163063)' [19:24:47] the "CUSTOM" icinga alert was a test for FR-specific alerts [19:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:52] T163063: SpecialRecentChangesLinked::doMainQuery bad query bringing down database server - https://phabricator.wikimedia.org/T163063 [19:24:52] (alnilam) [19:26:45] !log deploy finished [19:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:03] (I thought the logbot would do it) [19:28:49] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [19:36:09] that seemed like upload? [19:36:12] (03PS4) 10Rush: tools-proxy: Ensure kubelet is stopped on tools proxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/349109 (https://phabricator.wikimedia.org/T163391) (owner: 10Madhuvishy) [19:36:17] (03CR) 10Rush: [C: 031] tools-proxy: Ensure kubelet is stopped on tools proxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/349109 (https://phabricator.wikimedia.org/T163391) (owner: 10Madhuvishy) [19:36:58] brion: Oh, you are around… whenever you has a moment. [19:38:58] yes, our requests have increased from 10M/minute to 30 million/min [19:40:13] Amir1: Hmm, I wonder if that's cuz we're on naos/codfw why it isn't logging [19:40:41] the error rate is ok [19:40:45] yeah, This probably needs to be resolved but I guess not high priority [19:41:02] Amir1: i think so too (about the logbot normally doing that), sounds like something missing for naos vs mira [19:44:49] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [19:49:17] (03PS1) 10Ppchelko: Kafka: Enable topic deletion for the main kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/349280 (https://phabricator.wikimedia.org/T163392) [19:50:09] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [19:52:48] (03CR) 10Ppchelko: "Puppet compiler is happy: https://puppet-compiler.wmflabs.org/6191/" [puppet] - 10https://gerrit.wikimedia.org/r/349280 (https://phabricator.wikimedia.org/T163392) (owner: 10Ppchelko) [19:52:59] RECOVERY - Check Varnish expiry mailbox lag on cp2005 is OK: OK: expiry mailbox lag is 562 [19:53:09] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [19:53:49] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:56:00] mutante: Amir1 hrm, it definitely logged some stuff yesterday when I deployed. Unsure what changed. I know that it didn't work initially, but I did see godo.g restart tcpircbot and it worked after that. Unclear if any config was changed that puppet may have changed back. [19:58:35] What is "Varnish expiry mailbox lag" [19:59:50] https://varnish-cache.org/docs/trunk/reference/varnishstat.html doesn't say [20:00:48] (03CR) 10Madhuvishy: [C: 032] tools-proxy: Ensure kubelet is stopped on tools proxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/349109 (https://phabricator.wikimedia.org/T163391) (owner: 10Madhuvishy) [20:05:48] 06Operations, 06Labs, 13Patch-For-Review: Ensure kubelet is stopped on Tools Proxy hosts - https://phabricator.wikimedia.org/T163391#3199496 (10madhuvishy) 05Open>03Resolved [20:07:03] Krinkle: I think we kinda made up that term [20:07:33] Krinkle: It’s obviously lag due the the expire of a mialbox in Varnish. :P [20:07:37] *to [20:07:54] Wow, typos in my smartass comment. [20:08:05] it has to do with the removal (to free space, etc) of expired objects (including, I think, also "expired" prematurely to push out due to LRU) [20:08:15] there's a separate thread of execution which does the actual free-ing [20:08:43] the many threads handling normal traffic send notifications through a mailbox (one-way memory queue with locks, if you will) to inform it of objects it should expire [20:09:27] and "Varnish expiry mailbox lag" stat we're tracking there is basically "how many notifications of objects that need expiring/freeing are backlogged in that mailbox because the expiry thread is falling behind on processing them" [20:10:15] when it gets up in the millions, it can become the case that the majority of all the objects in the cache are referenced by those mailbox messages and awaiting expiry [20:10:36] hey uh...if i have a question about issues is here a good place to ask? [20:10:36] so long as references to them are backlogged there, they're considered busy/locked and can't be freed [20:10:57] so eventually that wreaks havoc [20:11:49] RECOVERY - Check Varnish expiry mailbox lag on cp2002 is OK: OK: expiry mailbox lag is 294196 [20:12:43] the alerting thresholds are imperfect (not just tuning them, they might really just not be the thing we should be alerting on) [20:13:05] and CRITs on those don't necessarily always translate to a real problem, they can self-recover from heavy mailbox lag without impacting users sometimes [20:13:32] it's an evolving situation I guess :) [20:13:50] Chrissymad: depends on the nature of the issue [20:14:06] Chrissymad: Depends on what kind of issue. We can't really help you if you lost your cat. [20:14:45] honestly it clearly wasn't important enough because i already forgot what it was between the time i asked that question and now...but it was related to wikipedia :P [20:15:06] in that case the only sensible answer to your non-question is 42 [20:15:39] * RainbowSprinkles writes Special:FindMyCat just in case [20:15:46] * Chrissymad has a dog [20:15:50] * Chrissymad finds findmycat useless [20:15:54] :P [20:15:59] Chrissymad: Dogs are better anyway ;-) [20:16:03] RainbowSprinkles: I agree! [20:16:08] my pupper just got out of surgery [20:16:14] and i cannot wait to pick him up [20:16:20] Awww, hope for a speedy recovery! [20:16:56] it wasn't major thankfully, he had a tumor that turned out to be 4 tumors but the vet is pretty confident its just a hystiocytoma [20:17:05] 06Operations, 10Traffic: Server hardware purchasing for Asia Cache DC - https://phabricator.wikimedia.org/T156033#3199538 (10BBlack) [20:18:37] https://usercontent.irccloud-cdn.com/file/qhRFqVBi/ [20:18:41] doggo for reference :D [20:19:22] * DatGuy pets [20:19:37] also known as dingus-doggus [20:19:52] an exemplary species of dogtard [20:22:39] (03CR) 10Smalyshev: [C: 031] wdqs - monitor response times for both eqiad and codfw [puppet] - 10https://gerrit.wikimedia.org/r/349241 (owner: 10Gehel) [20:25:20] (03PS1) 10Andrew Bogott: Dynamicproxy: Set up a GET-only frontend [puppet] - 10https://gerrit.wikimedia.org/r/349287 (https://phabricator.wikimedia.org/T115752) [20:27:42] jouncebot next [20:27:42] In 84 hour(s) and 32 minute(s): Wiktionary InterwikiSorting & Cognate deployment (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170424T0900) [20:27:52] (03PS2) 10Andrew Bogott: Dynamicproxy: Set up a GET-only frontend [puppet] - 10https://gerrit.wikimedia.org/r/349287 (https://phabricator.wikimedia.org/T115752) [20:27:58] is that a new deployment window? [20:28:24] we're in a deployment freeze this week [20:28:33] (due to datacenter-switchover stuff) [20:29:32] bblack: yes i know that but ive never seen the wikitonary deployment window before [20:32:27] oh I thought you were referring to the crazy "84 hours" :) [20:32:29] PROBLEM - Check Varnish expiry mailbox lag on cp2011 is CRITICAL: CRITICAL: expiry mailbox lag is 605428 [20:32:49] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:33:46] (03CR) 10BryanDavis: [C: 031] Dynamicproxy: Set up a GET-only frontend [puppet] - 10https://gerrit.wikimedia.org/r/349287 (https://phabricator.wikimedia.org/T115752) (owner: 10Andrew Bogott) [20:34:49] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:35:09] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:36:25] (03CR) 10Andrew Bogott: [C: 032] Dynamicproxy: Set up a GET-only frontend [puppet] - 10https://gerrit.wikimedia.org/r/349287 (https://phabricator.wikimedia.org/T115752) (owner: 10Andrew Bogott) [20:41:47] 06Operations, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3199656 (10aaron) I think it's worth trying persistent connections again. If there is no intra-DC replicati... [20:45:50] (03PS1) 10Andrew Bogott: Dynamicproxy: Syntax fixes to proxygetter nginx config [puppet] - 10https://gerrit.wikimedia.org/r/349291 [20:47:31] (03CR) 10Andrew Bogott: [C: 032] Dynamicproxy: Syntax fixes to proxygetter nginx config [puppet] - 10https://gerrit.wikimedia.org/r/349291 (owner: 10Andrew Bogott) [20:50:40] PROBLEM - Host rigel is DOWN: PING CRITICAL - Packet loss = 60%, RTA = 11595.65 ms [20:50:41] PROBLEM - Host saiph is DOWN: PING CRITICAL - Packet loss = 53%, RTA = 11097.88 ms [20:50:49] RECOVERY - Host saiph is UP: PING WARNING - Packet loss = 0%, RTA = 1897.40 ms [20:55:19] RECOVERY - Host rigel is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms [21:08:26] For the records, Phabricator's been down for me for a short while (Can Not Connect to MySQL). WFM again now. [21:10:58] i can confirm too [21:11:08] back up now. [21:11:13] probaly intermittent? [21:11:57] errm, yeah, still an intermittent problem. [21:11:59] PROBLEM - Check Varnish expiry mailbox lag on cp2005 is CRITICAL: CRITICAL: expiry mailbox lag is 714261 [21:12:08] oh [21:12:13] I get a cluster expection now [21:12:16] thats new [21:12:22] Unable to establish a connection to any database host (while trying "phabricator_user"). All masters and replicas are completely unreachable. [21:12:25] jynus ^^ [21:12:36] any ops? [21:15:35] twentyafterfour: ^ [21:15:35] the bot stopped working [21:15:52] Also pinging mutante (sorry but you're on ops duty) [21:15:52] Filled task https://phabricator.wikimedia.org/T163507 [21:15:57] RainbowSprinkles: ^ [21:16:43] ... [21:16:54] wfm right now [21:17:05] yeah, it's intermittent [21:17:29] probably some "andre shouldn't work late" implementation I'm not aware of ;) [21:20:00] there's been a few DB related (like, software not behaving with their db interactions) issues since the dc switch [21:22:32] Do we need to wake up DBAs or is someone in rel-eng and phab experience troubleshooting? [21:22:46] * robh just wants to know if he needs to start pinging people ;] [21:22:54] twentyafterfour: any ideas thus far? [21:23:14] I don't want to sit here like I'm not willing to help, just not sure how I can assist =] [21:23:43] what's up? intermittent phab db issues? [21:23:46] I haven't seen it myself [21:23:47] I think this is releng gets the ball first and then we see [21:23:58] I got emails from whatever monitoring service [21:24:08] apergos: yeah, agreed [21:24:23] I just got an error myself [21:24:43] paravoid: same as what paladox pasted? [21:24:51] Unable to connect to master database ("phabricator_user"). This is a severe failure; your request did not complete. [21:25:01] greg-g its showing all kinds of errors [21:25:05] yeah [21:25:07] weird. I don't see anything too strange in ganglia [21:25:22] all related to not being able to connect to the db [21:25:31] twentyafterfour: with the dbs? [21:25:34] 06Operations, 10fundraising-tech-ops, 13Patch-For-Review: Revisit paging strategy for frack servers - https://phabricator.wikimedia.org/T163368#3199856 (10Jgreen) I removed 'admins' from contact_groups for frack hosts, so we should stop seeing alerts re. frack hosts in #wikimedia-operations, so frack host al... [21:25:49] greg-g: yeah, still digging [21:26:15] twentyafterfour this is the error i see https://phabricator.wikimedia.org/F7673574 [21:26:25] try uploading an image that got me many errors [21:27:13] yeah, we don't need any more error msg reports, they are all some variation of unable to connect to the dbs [21:30:06] I think the problem is phab makes a lot of connections and heavy load from a bot will exceed the maximum connection limit which is set in mysql [21:30:26] why do you say so twentyafterfour? [21:30:28] it happens periodically when a spider disregards robots.txt [21:30:42] paravoid: because I see a ton of hits from the same IP in access log [21:31:06] and because the normal load on phab already is close to exceeding connection limits [21:31:38] !log stopped phd on iridium to reduce load on the database [21:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:57] the ip is 90.231.10.86 [21:33:20] which is telia in sweden, supposedly [21:33:56] agent is: Mozilla/5.0 (compatible; mbot/1.8; cust0002; +https://www.teorem.se/bot.html) [21:34:24] I don't really see a causative link so far? [21:34:29] do you have more evidence to support this theory? [21:34:49] paravoid: I'm not sure yet, it's the best I've got so far [21:35:18] https://www.teorem.se/bot.html link results in it asking for me to continue as safari is unsure if it is safe. [21:35:21] there is a spike in traffic and ~90% of the recent access log entries are from that IP [21:35:35] paladox: it just leads to a 404 anyway [21:35:37] there is nothing there [21:35:41] oh [21:35:54] email Abuse@telia.com [21:36:05] I really don't see how clicking on the link in the UA of a bot will help fix this [21:36:06] fwiw a similar thing happened a few days ago and I blocked an IP [21:36:07] http://whatmyip.co/view/ip_addresses/1525090816/90.231.10.0_90.231.10.255 [21:36:25] https://grafana.wikimedia.org/dashboard/db/mysql?panelId=9&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1043 [21:36:28] is the client connection graph [21:36:39] https://ganglia.wikimedia.org/latest/?r=4hr&cs=&ce=&m=cpu_report&c=Miscellaneous+eqiad&h=iridium.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=NOGROUPS [21:36:51] https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1043 is db1043's graphs [21:36:57] ^ that shows the spike in the past hour [21:37:31] definitely a spike, but looks like it's more so expensive queries than client connections [21:38:14] where are phabricator's logs? [21:38:34] paravoid: on iridium in /var/log/apache2/phabricator_access.log and phabricator_error.log [21:38:41] application logs I meant [21:39:00] that is the application's logs [21:39:05] it outputs via apache [21:39:17] there are login logs in phab [21:40:56] not a log of connections right now [21:41:05] (for the records, user login logs in Phab are at https://phabricator.wikimedia.org/people/logs/ ) [21:41:20] I can see a log of mysql connections via the cli: [21:41:23] fwiw the graph also shows a significant spike in connections [21:41:29] /srv/phab/phabricator/bin/storage shell [21:41:30] but errors stopped at 00:35 [21:41:50] andre__: thanks I was looking for that url [21:41:55] madhuvishy: yeah, but from 25-50 to 250 [21:42:06] yup [21:42:56] I still think it's that crawler hitting every page... it doesn't take much to DOS phabricator [21:43:01] every page view is very expensive [21:43:27] like I could easily DOS it with a 1 megabit connection and a P3 CPU [21:44:25] it really sounds like the database stopped responding though, that's a different problem [21:44:46] and it wasn't a max connections issue either I think [21:45:05] hmm currently there aren't that many connections open to mysql (at least not from what I can see, but I don't have a privileged user) [21:45:16] there aren't no [21:45:54] ahh, the access log isn't moving too fast either [21:46:13] the spike ended :-/ [21:47:31] maxconn 5000 [21:47:34] !log started phd on iridium [21:47:38] is haproxy's maxconn [21:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:13] paravoid: each web request makes 20+ database connections. it's kind of gross. [21:48:21] that's simultaneous though [21:48:49] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [21:48:54] we've exceeded it a few times before just from yahoo and bing hitting us at the same time [21:49:19] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [21:49:20] root@dbproxy1003:/etc/haproxy# netstat -nap |grep 10.64.0.198:3306 |grep -c ESTAB [21:49:23] 100 [21:49:41] 06Operations, 10DBA, 10Phabricator: Intermitten outage on phabricator, needs investigation - https://phabricator.wikimedia.org/T163507#3199906 (10Aklapper) p:05Unbreak!>03High This is intermittent so I don't see why this should be Unbreak Now [21:49:51] yeah it's barely anything now :-/ [21:50:02] 06Operations, 10DBA, 10Phabricator: Intermittent DB connectivity problem on phabricator, needs investigation - https://phabricator.wikimedia.org/T163507#3199908 (10Aklapper) [21:50:04] ok, ping me if it happens again [21:50:21] fwiw, for anyone reading [21:50:36] it's cache_misc esams -> cache_misc codfw -> cache_misc eqiad -> iridium -> dbproxy1003 -> db1043 [21:50:58] paravoid: thanks! [21:51:00] twentyafterfour i wonder should a task be created to try and improve phabricators connections to try and prevent ddos from happening. [21:51:10] as it took me a while to figure out where to look [21:51:11] I'll keep an eye on it [21:51:16] RECOVERY - Host cp3038 is UP: PING OK - Packet loss = 0%, RTA = 119.54 ms [21:51:18] on an entirely separate matter [21:51:27] I've been getting *a lot* of CSRF errors lately [21:51:31] from phabricator [21:51:35] * twentyafterfour adds that to wikitech docs somewhere [21:51:36] PROBLEM - traffic-pool service on cp3038 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is activating [21:51:36] PROBLEM - Check systemd state on cp3038 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. [21:51:36] PROBLEM - Freshness of OCSP Stapling files on cp3038 is CRITICAL: CRITICAL: File /var/cache/ocsp/globalsign-2016-ecdsa-unified.ocsp is more than 18300 secs old! [21:51:36] PROBLEM - Freshness of zerofetch successful run file on cp3038 is CRITICAL: CRITICAL: File /var/netmapper/.update-success is more than 86400 secs old! [21:51:36] PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: Puppet last ran 7 days ago [21:51:56] paravoid: like invalid token errors? [21:51:59] also login prompts that I can just cancel and everything works, probably related [21:52:02] yeah [21:52:08] that doesn't sound right [21:52:10] it either has to do with multiple tabs, or opening file attachments, or both [21:52:30] hmmm... well file attachments do some funky stuff with csrf tokens [21:52:36] RECOVERY - traffic-pool service on cp3038 is OK: OK - traffic-pool is active [21:52:36] RECOVERY - Check systemd state on cp3038 is OK: OK - running: The system is fully operational [21:52:36] RECOVERY - Freshness of zerofetch successful run file on cp3038 is OK: OK [21:52:36] RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [21:53:02] in order to provide security validation it has to pass tokens across two domains [21:53:02] ignore that cp3038, it's esams' remote hands fixing things [21:53:24] robh: have you noticed that too? [21:53:26] PROBLEM - HTTPS Unified RSA on cp3038 is CRITICAL: SSL CRITICAL - OCSP staple validity for en.wikipedia.org has -59299 seconds left [21:53:26] PROBLEM - HTTPS Unified ECDSA on cp3038 is CRITICAL: SSL CRITICAL - OCSP staple validity for en.wikipedia.org has -52519 seconds left [21:53:46] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 70 ESP OK [21:53:46] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 70 ESP OK [21:53:46] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 70 ESP OK [21:53:56] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 70 ESP OK [21:53:56] RECOVERY - Host cp3045 is UP: PING OK - Packet loss = 0%, RTA = 119.64 ms [21:53:56] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 56 ESP OK [21:53:56] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 56 ESP OK [21:53:56] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 70 ESP OK [21:54:06] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 56 ESP OK [21:54:06] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 56 ESP OK [21:54:16] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 70 ESP OK [21:54:16] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 70 ESP OK [21:54:16] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 56 ESP OK [21:54:16] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 56 ESP OK [21:54:16] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 56 ESP OK [21:54:17] paravoid: asking about phab errors? [21:54:25] yes [21:54:26] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 56 ESP OK [21:54:26] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 56 ESP OK [21:54:26] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 56 ESP OK [21:54:26] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 56 ESP OK [21:54:32] ive not seen a single one on all my reloads [21:54:36] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 70 ESP OK [21:54:36] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 70 ESP OK [21:54:36] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 70 ESP OK [21:55:00] 06Operations, 10hardware-requests, 15User-fgiunchedi: Additional ram quote for Prometheus baremetal - https://phabricator.wikimedia.org/T161606#3199930 (10RobH) 05Open>03Resolved [21:55:02] phabricator loves me best. [21:55:20] ok, reproduced this just now [21:55:38] the login form, not the CSRF (not sure about the CSRF as it requires form submission) [21:55:45] twentyafterfour: do you have access to e.g. https://phabricator.wikimedia.org/T161723 ? [21:55:50] you're phab admin so I guess you would [21:56:03] paravoid: I do have access to that, but phab admins don't have access to everything [21:56:13] so click on the first attachment's "Download" link [21:56:18] doesn't matter if you actually download it or not [21:56:32] then click on the attachment's heading, not the download link [21:56:36] PROBLEM - Freshness of zerofetch successful run file on cp3045 is CRITICAL: CRITICAL: File /var/netmapper/.update-success is more than 86400 secs old! [21:56:46] PROBLEM - Freshness of OCSP Stapling files on cp3045 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2016-rsa-unified.ocsp is more than 18300 secs old! [21:56:46] PROBLEM - HTTPS Unified ECDSA on cp3045 is CRITICAL: SSL CRITICAL - OCSP staple validity for en.wikipedia.org has -52718 seconds left [21:56:46] PROBLEM - HTTPS Unified RSA on cp3045 is CRITICAL: SSL CRITICAL - OCSP staple validity for en.wikipedia.org has -37958 seconds left [21:56:49] it gives me a normal download prompt...? [21:57:03] works for me [21:57:11] 06Operations, 10ops-esams, 06DC-Ops: Broken IPMI/drac on cp3038 and cp3045 - https://phabricator.wikimedia.org/T157537#3199932 (10faidon) Both came up just a few minutes ago :) [21:57:22] when you say attachment heading, you mean the file # in top left, or just on the filename in the middle of the screen? [21:57:43] both work normally for me, just not sure which you were asking us to try [21:57:54] yeah works normally for me as well, in firefox [21:58:00] indeed, ive been doing in ff as well [21:58:16] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:58:21] * twentyafterfour tries chrome [21:58:34] no I use firefox [21:58:40] let me try disabling privacy badget again.. [21:58:42] badger* [21:58:54] ive whitelisted phab's url from all my plugin interactions [21:58:58] well, all of wikimedias domains, heh [21:58:59] yeah works now.. [21:59:00] damn [21:59:21] ok, I'm on a coffee shop wifi, but is anyone else having issues loading wikitech? phab loads but not wikitech [21:59:33] wfm [21:59:38] dangit [21:59:39] works for me [21:59:47] wfm [22:00:06] twentyafterfour: seems it's privacy badger again, sorry! [22:00:18] woooo esams systems recovery [22:00:28] sorry, im on a 5 minute time delay on non procurment things! [22:00:34] paravoid: no problem, definitely worth looking into these things, CSRF token issues could be worrying [22:00:36] RECOVERY - Freshness of zerofetch successful run file on cp3045 is OK: OK [22:00:52] well I don't know if it fixed the CSRF issue, but it probably is related [22:01:26] RECOVERY - HTTPS Unified RSA on cp3038 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 588157 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2018-01-03 12:00:00 +0000 (expires in 257 days) [22:01:27] RECOVERY - HTTPS Unified ECDSA on cp3038 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 594997 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2018-01-03 12:00:00 +0000 (expires in 257 days) [22:01:46] RECOVERY - Freshness of OCSP Stapling files on cp3038 is OK: OK [22:02:14] weird I'm getting this on wikitech trying to edit with visual editor: "Error loading data from server: apierror-visualeditor-docserver-http: HTTP 500. Would you like to retry?" [22:03:05] happends to me to [22:03:06] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [22:04:33] chasemp andrewbogott ^^ (visualeditor not working on wikitech) [22:06:19] (03PS1) 10Andrew Bogott: novaproxy: Open up port 5669 for quering the proxy API. [puppet] - 10https://gerrit.wikimedia.org/r/349339 [22:07:06] paladox: why do you ping these people because of that? [22:07:13] there is a bug report. [22:07:21] andre__ i thought labs maintains wikitech [22:07:22] and it's not that urgent? people can still edit. [22:07:45] (03CR) 10Andrew Bogott: [C: 032] novaproxy: Open up port 5669 for quering the proxy API. [puppet] - 10https://gerrit.wikimedia.org/r/349339 (owner: 10Andrew Bogott) [22:07:49] (03PS5) 10Catrope: Enable RCFilters beta feature on all remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347045 (https://phabricator.wikimedia.org/T144458) [22:08:13] apparently it's been busted since dc switch [22:08:17] it's https://phabricator.wikimedia.org/T163438 [22:08:31] 06Operations, 06Performance-Team: Access request to Icinga control panel to acknowledge Performance alerts - https://phabricator.wikimedia.org/T163432#3199960 (10Dzahn) @Gilles alright. thanks. so first step is we need to create Icinga users ("contacts") for you. Krinkle has one but not the other 3 of you. An... [22:08:34] its also not worth working on honestly [22:08:56] it will be "fixed" in a week [22:09:21] Also some weird stuff from Phab. "Unable to Reach Any Database Unable to establish a connection to any database host (while trying "phabricator_draft"). All masters and replicas are completely unreachable." [22:09:35] Worked in the second try. [22:09:59] Niharika: yeah some intermittent problems there, seems mostly ok now. Releng folks are monitoring [22:10:01] hmmm [22:10:03] Niharika, yeah, we discussed that a while ago here [22:10:09] intermittent :-/ [22:10:17] Okay. :) [22:10:31] our really fast spider is back [22:10:41] I'm gonna block the IP [22:10:50] (03PS1) 10Andrew Bogott: novaproxy: Rename ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/349341 [22:12:06] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:12:21] (03CR) 10Andrew Bogott: [C: 032] novaproxy: Rename ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/349341 (owner: 10Andrew Bogott) [22:12:27] RECOVERY - Check Varnish expiry mailbox lag on cp2011 is OK: OK: expiry mailbox lag is 0 [22:12:56] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:14:14] (03PS1) 1020after4: Add 90.231.10.86 to phabbanlist, this crawler is causing outages. [puppet] - 10https://gerrit.wikimedia.org/r/349342 [22:14:27] is it still happening? [22:14:54] paravoid: yeah and the spike in load corresponds with a bunch of new requests from the same crawler [22:15:02] no I mean right now [22:15:14] or did you block it already? [22:15:39] paravoid: see https://gerrit.wikimedia.org/r/349342 [22:15:57] it's still happening [22:16:02] rapid fire requests from that IP [22:16:05] where do you see that? [22:16:10] no the database errors I mean [22:16:13] on iridium in the access log [22:16:25] [Thu Apr 20 22:11:03.517163 2017] [:error] [pid 20752] [client 90.231.10.86:20403] [2017-04-20 22:11:03] PHLOG: 'Retrying (2) after AphrontConnectionQueryException: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #2003: Can't connect to MySQL server on 'm3-master.eqiad.wmnet' (99).' at [/srv/deployment/phabricator/deployment-cache/revs/7dd45143c333b8fb854b8f40bd96c46ea56 [22:16:27] the database errors only just the one report from Niharika above [22:16:32] a0970/libphutil/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:108] [22:17:56] paravoid: I'm relatively sure it's DOS from that IP which is causing the database errors, I've seen the exact same problem before and it was from this same bot [22:17:56] !log setting tw_reuse to 1 on dbproxy1003 [22:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:04] yeah but I did ^^^ [22:18:16] I'm trying to fix the root cause, not the symptom :) [22:18:28] (03PS1) 10Ejegg: Ensure symlink to enable mcrypt for CLI [puppet] - 10https://gerrit.wikimedia.org/r/349343 [22:18:32] hmm [22:18:43] and it didn't happen again fwiw [22:18:49] I thought the root cause was just that phabricator is not very efficient with it's connections [22:18:59] makes too many of them [22:19:32] paravoid: so tw_reuse seems to have helped? [22:19:59] either that or the request rate dropped a little bit [22:20:07] let's keep it like that for a bit, I'm around [22:20:20] it does seem to have slowed down a bit but the crawler is still going at it [22:20:27] (it certainly doesn't hurt) [22:20:56] anyone have a second to review https://gerrit.wikimedia.org/r/349343 ? Just adding a symlink to /etc/php5/cli/conf.d on the integration boxes [22:21:30] I wouldn't merge that until the FIXME question gets answered [22:21:49] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [22:22:49] bd808: re: T163438, I wouldn't call it "fixed" by getting back to eqiad, even in quotes [22:22:50] T163438: VisualEditor broken on wikitech: "Error loading data from server: apierror-visualeditor-docserver-http: HTTP 500." - https://phabricator.wikimedia.org/T163438 [22:22:54] paravoid: ah, thcipriani just tracked that down, there's no postinst file making the link in the mcrypt trusty package, though there is in the packages for the other extensions [22:22:56] bd808: it's a real bug and should be fixed [22:23:09] will update the comment [22:23:13] but not in the jessie package? [22:23:41] jessie package looks fine [22:23:51] I think it's there in the jessie package for mcrypt, too. just ubuntu doesn't turn it on [22:23:57] then put it in a os_version conditional [22:24:00] turn what on? [22:24:10] the trusty package from http://packages.ubuntu.com/trusty/amd64/php5-mcrypt/download is missing https://gist.github.com/thcipriani/731c66ca1c976dfd2d6244a84e01e190 [22:24:33] well "missing" vs the php5-curl package for trusty anyway [22:24:56] the mcrypt extension [22:26:39] "turn it on" == symlink an ini file showing php where to find the mcrypt.so file so that it knows the the php-mcrypt extension is installed. [22:27:10] it should be php5enmod I think? [22:27:15] not manually creating a symlink [22:27:21] yeah [22:27:52] modules/phragile/manifests/init.pp: command => '/usr/sbin/php5enmod mcrypt', [22:27:55] modules/toollabs/manifests/exec_environ.pp: command => '/usr/sbin/php5enmod mcrypt', [22:28:25] sigh at everyone repeating this instead of creating the right abstraction [22:28:30] but ok, feel free to do either here [22:28:49] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:28:50] heh the abstraction exists even, it's just under the mediawiki module [22:28:53] modules/mediawiki/manifests/php_enmod.pp: [22:28:59] !log enable rate limiting in phabricator [22:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:07] ah, gotcha [22:29:57] twentyafterfour: just a suggestion maybe make sure robots.txt is config up-to-date [22:31:13] (03PS2) 10Ejegg: Ensure mcrypt enabled on integration slaved [puppet] - 10https://gerrit.wikimedia.org/r/349343 [22:31:40] ok, ^^^ uses php5enmod [22:31:41] Zppix: this crawler isn't honoring the robots.txt [22:32:05] twentyafterfour: what crawler is it *if* your able to tell me [22:32:46] Zppix: read backscroll, it's there [22:33:07] please, before distracting with basic questions [22:33:26] oh sorry i didnt see that when i was looking throught that my bad [22:33:47] (03PS3) 10Ejegg: Ensure mcrypt enabled on integration slaves [puppet] - 10https://gerrit.wikimedia.org/r/349343 [22:36:55] paravoid: oh, I didn't see your 'abstraction' comment till just now, looking at how to use that [22:37:05] no you can't [22:37:14] it's a mediawiki one, it really needs to be moved out of there first [22:37:28] what we really need is is a "php" puppet module... [22:37:32] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3200051 (10AlexRus) >>! In T162035#3197947, @ema wrote: > The user-faci... [22:37:39] RECOVERY - HTTPS Unified ECDSA on cp3045 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 592831 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2018-01-03 12:00:00 +0000 (expires in 257 days) [22:37:39] RECOVERY - HTTPS Unified RSA on cp3045 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 585983 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2018-01-03 12:00:00 +0000 (expires in 257 days) [22:37:44] ah, cool, i'll leave the cut-n-pasted one in https://gerrit.wikimedia.org/r/349343 then [22:37:44] but yeah, that's larger than your change, won't ask you to do that :) [22:37:59] RECOVERY - Freshness of OCSP Stapling files on cp3045 is OK: OK [22:38:29] heh, you probably don't want me trying to make more than trivial changes in puppet [22:39:00] noted :P [22:41:18] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3200057 (10matmarex) @alexrus Can you give an example of a file that is... [22:44:16] (03PS4) 10Ejegg: Ensure mcrypt enabled on integration slaves [puppet] - 10https://gerrit.wikimedia.org/r/349343 [22:44:25] k, rebased and merge-able ^^^ [22:48:23] (03PS3) 1020after4: Phab: create some task types and corresponding custom fields. [puppet] - 10https://gerrit.wikimedia.org/r/345618 (https://phabricator.wikimedia.org/T93499) [22:48:36] (03CR) 1020after4: [C: 031] Phab: create some task types and corresponding custom fields. [puppet] - 10https://gerrit.wikimedia.org/r/345618 (https://phabricator.wikimedia.org/T93499) (owner: 1020after4) [22:48:46] (03PS4) 1020after4: Phab: create some task types and corresponding custom fields. [puppet] - 10https://gerrit.wikimedia.org/r/345618 (https://phabricator.wikimedia.org/T93499) [22:49:51] (03CR) 1020after4: [C: 031] keyholder: create /run/keyholder at boot [puppet] - 10https://gerrit.wikimedia.org/r/348760 (owner: 10Filippo Giunchedi) [22:49:56] (03CR) 10Thcipriani: Ensure mcrypt enabled on integration slaves (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/349343 (owner: 10Ejegg) [22:52:00] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3200084 (10AlexRus) >>! In T162035#3200057, @matmarex wrote: > @alexrus... [22:53:28] (03PS5) 10Ejegg: Ensure mcrypt enabled on integration slaves [puppet] - 10https://gerrit.wikimedia.org/r/349343 [22:54:02] (03CR) 10Ejegg: "thanks thcipriani, just added the conditional" [puppet] - 10https://gerrit.wikimedia.org/r/349343 (owner: 10Ejegg) [22:56:00] (03CR) 10Krinkle: "If an individual test changes globals we can enable it there with "@backupGlobals enabled" in that test class. But I don't think we should" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349210 (owner: 10Hashar) [22:56:40] !! now i remember my issue [22:57:40] notifications keep showing up now matter how many times i mark as read [23:05:41] (03CR) 10Thcipriani: [C: 031] Ensure mcrypt enabled on integration slaves [puppet] - 10https://gerrit.wikimedia.org/r/349343 (owner: 10Ejegg) [23:11:59] RECOVERY - Check Varnish expiry mailbox lag on cp2005 is OK: OK: expiry mailbox lag is 640 [23:21:10] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Prepare and improve the datacenter switchover procedure - https://phabricator.wikimedia.org/T154658#3200172 (10Mattflaschen-WMF) [23:22:10] (03PS1) 10Mattflaschen: Force Labs to eqiad, since all the services are there. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349349 (https://phabricator.wikimedia.org/T163514) [23:25:45] (03PS1) 10Krinkle: mwgrep: If --title is set, don't also require '*.js/.css' [puppet] - 10https://gerrit.wikimedia.org/r/349351 [23:28:52] (03PS1) 10Krinkle: mwgrep: Add --etitle option [puppet] - 10https://gerrit.wikimedia.org/r/349352 [23:52:09] RECOVERY - Check Varnish expiry mailbox lag on cp2024 is OK: OK: expiry mailbox lag is 50 [23:52:59] RECOVERY - Check Varnish expiry mailbox lag on cp2026 is OK: OK: expiry mailbox lag is 7021