[01:20:41] (03PS1) 10Reedy: Collapse PHP_SAPI conditionals down into one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393355 [02:07:52] (03PS1) 10Andrew Bogott: role::puppetmaster::standalone: allow specification of puppet_major_version [puppet] - 10https://gerrit.wikimedia.org/r/393357 [02:08:13] (03CR) 10jerkins-bot: [V: 04-1] role::puppetmaster::standalone: allow specification of puppet_major_version [puppet] - 10https://gerrit.wikimedia.org/r/393357 (owner: 10Andrew Bogott) [02:10:24] (03CR) 10Andrew Bogott: [V: 032 C: 032] role::puppetmaster::standalone: allow specification of puppet_major_version [puppet] - 10https://gerrit.wikimedia.org/r/393357 (owner: 10Andrew Bogott) [02:11:25] 10Operations, 10Performance-Team, 10Traffic: load.php requests taking multiple minutes - https://phabricator.wikimedia.org/T181315#3786326 (10Tgr) (The file had to be deleted because I messed up and made it public. See also T181317. Can provide it on request though.) [02:11:42] 10Operations, 10Performance-Team, 10Traffic: load.php requests taking multiple minutes - https://phabricator.wikimedia.org/T181315#3786328 (10Tgr) [02:47:42] (03PS1) 10Andrew Bogott: puppetmaster.erb: allow switching of puppetmaster_rack_path [puppet] - 10https://gerrit.wikimedia.org/r/393358 [03:24:54] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 762.31 seconds [03:59:05] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 178.73 seconds [04:02:41] (03PS13) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) [04:03:49] (03CR) 10jerkins-bot: [V: 04-1] $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) (owner: 10TerraCodes) [04:28:19] (03PS1) 10TerraCodes: update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393360 [04:29:35] (03CR) 10jerkins-bot: [V: 04-1] update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393360 (owner: 10TerraCodes) [04:40:46] (03PS14) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) [04:41:26] (03Abandoned) 10TerraCodes: update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393360 (owner: 10TerraCodes) [04:41:32] (03CR) 10jerkins-bot: [V: 04-1] $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) (owner: 10TerraCodes) [04:44:40] (03PS15) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) [05:18:24] PROBLEM - Nginx local proxy to apache on mw2133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:19:14] RECOVERY - Nginx local proxy to apache on mw2133 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.200 second response time [05:49:14] 10Operations, 10Developer-Relations, 10cloud-services-team (Kanban): Create discourse-mediawiki.wmflabs.org (pilot instance) - https://phabricator.wikimedia.org/T180854#3786373 (10bd808) >>! In T180854#3785110, @Qgil wrote: > Could these upgrades via UI make the requirements for maintenace simpler? This typ... [05:57:27] 10Operations, 10Developer-Relations, 10cloud-services-team (Kanban): Create discourse-mediawiki.wmflabs.org (pilot instance) - https://phabricator.wikimedia.org/T180854#3786376 (10bd808) >>! In T180854#3783598, @Qgil wrote: >>>! In T180854#3778284, @bd808 wrote: >> I would also recommend that the deployment... [06:31:04] PROBLEM - puppet last run on mw2173 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/hphpd/hphpd.ini] [06:56:04] RECOVERY - puppet last run on mw2173 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:48:54] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of JeanBono → Rexcornot: supervision needed - https://phabricator.wikimedia.org/T181170#3786384 (10alanajjar) @Marostegui online now? [08:27:31] (03PS1) 10ArielGlenn: abstract recombine job needs to read the gzipped input files [dumps] - 10https://gerrit.wikimedia.org/r/393364 [08:28:30] (03CR) 10ArielGlenn: [C: 032] abstract recombine job needs to read the gzipped input files [dumps] - 10https://gerrit.wikimedia.org/r/393364 (owner: 10ArielGlenn) [08:29:54] PROBLEM - puppet last run on labtestneutron2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:42] !log ariel@tin Started deploy [dumps/dumps@ec21673]: fix abstracts recombine job [08:32:44] !log ariel@tin Finished deploy [dumps/dumps@ec21673]: fix abstracts recombine job (duration: 00m 02s) [08:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:54] RECOVERY - puppet last run on labtestneutron2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:05:53] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of JeanBono → Rexcornot: supervision needed - https://phabricator.wikimedia.org/T181170#3786442 (10Marostegui) No, will need to wait till Monday [10:06:42] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of JeanBono → Rexcornot: supervision needed - https://phabricator.wikimedia.org/T181170#3786443 (10alanajjar) @Marostegui okay [10:38:42] (03PS1) 10ArielGlenn: extend command line length for dumps status files tarball creation [puppet] - 10https://gerrit.wikimedia.org/r/393366 [10:40:09] (03CR) 10ArielGlenn: [C: 032] extend command line length for dumps status files tarball creation [puppet] - 10https://gerrit.wikimedia.org/r/393366 (owner: 10ArielGlenn) [11:17:44] PROBLEM - carbon-frontend-relay metric drops on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [100.0] %27https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1panelId=21fullscreen%27+%27https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1panelId=21fullscreen%27 [11:18:55] PROBLEM - puppet last run on graphite1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check-fresh-files-in-dir.py] [11:20:44] RECOVERY - carbon-frontend-relay metric drops on graphite1001 is OK: OK: Less than 80.00% above the threshold [25.0] %27https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1panelId=21fullscreen%27+%27https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1panelId=21fullscreen%27 [11:43:55] RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [12:18:34] PROBLEM - Apache HTTP on mw2103 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:19:24] RECOVERY - Apache HTTP on mw2103 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.120 second response time [12:46:46] (03Draft2) 10Jayprakash12345: Enable AdvancedSearch in Arabic Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393369 [12:47:58] (03PS3) 10Jayprakash12345: Enable AdvancedSearch in Arabic Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393369 (https://phabricator.wikimedia.org/T180291) [13:04:05] PROBLEM - carbon-frontend-relay metric drops on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [100.0] %27https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1panelId=21fullscreen%27+%27https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1panelId=21fullscreen%27 [13:05:05] RECOVERY - carbon-frontend-relay metric drops on graphite1001 is OK: OK: Less than 80.00% above the threshold [25.0] %27https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1panelId=21fullscreen%27+%27https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1panelId=21fullscreen%27 [13:30:24] PROBLEM - HHVM rendering on mw2242 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:31:14] RECOVERY - HHVM rendering on mw2242 is OK: HTTP OK: HTTP/1.1 200 OK - 73671 bytes in 0.291 second response time [13:37:45] 10Operations, 10Graphite: cpjobqueue spamming statsd metrics - https://phabricator.wikimedia.org/T181333#3786600 (10fgiunchedi) [13:37:54] 10Operations, 10Graphite: cpjobqueue spamming statsd metrics - https://phabricator.wikimedia.org/T181333#3786612 (10fgiunchedi) p:05Triage>03Unbreak! [13:40:32] !log drop incoming statsd from scb to graphite1001 temporarily - T181333 [13:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:41] T181333: cpjobqueue spamming statsd metrics - https://phabricator.wikimedia.org/T181333 [13:41:02] 10Operations, 10Services, 10Graphite: cpjobqueue spamming statsd metrics - https://phabricator.wikimedia.org/T181333#3786616 (10fgiunchedi) [13:48:44] (03PS1) 10ArielGlenn: don't preserve timestamp of dump statusfiles tarball during rsync [puppet] - 10https://gerrit.wikimedia.org/r/393372 [13:50:03] (03CR) 10ArielGlenn: [C: 032] don't preserve timestamp of dump statusfiles tarball during rsync [puppet] - 10https://gerrit.wikimedia.org/r/393372 (owner: 10ArielGlenn) [13:51:15] PROBLEM - carbon-frontend-relay metric drops on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [100.0] %27https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1panelId=21fullscreen%27+%27https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1panelId=21fullscreen%27 [13:52:15] RECOVERY - carbon-frontend-relay metric drops on graphite1001 is OK: OK: Less than 80.00% above the threshold [25.0] %27https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1panelId=21fullscreen%27+%27https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1panelId=21fullscreen%27 [13:58:35] 10Operations, 10Services, 10Graphite: cpjobqueue spamming statsd metrics - https://phabricator.wikimedia.org/T181333#3786649 (10fgiunchedi) More spam, from kafka topic for cpjobqueue, note the repeated `retry_change-prop_retry_change-prop_retry_change-prop_retry_change-prop_retry_change-prop_retry_change-`... [14:10:37] !log roll-restart cpjobqueue to alleviate metrics leak - T181333 [14:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:43] T181333: cpjobqueue spamming statsd metrics - https://phabricator.wikimedia.org/T181333 [14:23:14] https://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&c=API+application+servers+eqiad&h=&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name .. looks like ganglia has stopped recording metrics here for over 2 days .. [14:25:55] 10Operations, 10Services, 10Graphite: cpjobqueue spamming statsd metrics - https://phabricator.wikimedia.org/T181333#3786683 (10fgiunchedi) Looks like cxserver is having the same problem wrt repeated gc metrics ``` 14:24:39.537511 IP scb1003.eqiad.wmnet.50926 > graphite1001.eqiad.wmnet.8125: UDP, length 142... [14:26:53] !log restart cxserver on scb100[34] - T181333 [14:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:01] T181333: cpjobqueue spamming statsd metrics - https://phabricator.wikimedia.org/T181333 [14:28:48] subbu: yeah ganglia is going away [14:30:32] ok. where is the equivalent grafana graph that for those? [14:30:48] s/that// [14:32:31] subbu: https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&cluster=appserver&orgId=1 [14:34:04] !log rolling restart of cxserver to alleviate metrics leak - T181333 [14:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:10] T181333: cpjobqueue spamming statsd metrics - https://phabricator.wikimedia.org/T181333 [14:35:06] thx [14:40:25] (03PS1) 10ArielGlenn: add top level index files to dump status file tarball for rsync [puppet] - 10https://gerrit.wikimedia.org/r/393374 [14:42:58] (03CR) 10ArielGlenn: [C: 032] add top level index files to dump status file tarball for rsync [puppet] - 10https://gerrit.wikimedia.org/r/393374 (owner: 10ArielGlenn) [14:48:05] PROBLEM - Host mc2026 is DOWN: PING CRITICAL - Packet loss = 100% [14:49:34] RECOVERY - Host mc2026 is UP: PING OK - Packet loss = 0%, RTA = 36.13 ms [15:05:16] 10Operations, 10Services, 10Graphite: cpjobqueue spamming statsd metrics - https://phabricator.wikimedia.org/T181333#3786699 (10fgiunchedi) Statsite and network graphs from graphite1001 {F10994913} {F10994912} [15:08:41] 10Operations, 10Services, 10Graphite: cpjobqueue spamming statsd metrics - https://phabricator.wikimedia.org/T181333#3786700 (10fgiunchedi) Top 20 sent metrics from scb in eqiad: ``` scb1002:~$ sudo timeout 1m ngrep -q -W byline . udp dst port 8125 | grep -v -e '^U ' -e '^$' | sed -e 's/:.*//' | sort | cut... [15:32:32] !og restarted statsd-proxy on graphite1001 (died during investigation) [15:36:49] volans you missed l :) [15:36:53] L [15:37:08] oh, my bad... [15:37:12] !og restarted statsd-proxy on graphite1001 (died during investigation) T181333 [15:37:13] T181333: cpjobqueue spamming statsd metrics - https://phabricator.wikimedia.org/T181333 [15:37:17] !log restarted statsd-proxy on graphite1001 (died during investigation) T181333 [15:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:31] * volans maybe can do it :D [15:39:00] thanks for spotting it [15:45:19] !log ppchelko@tin Started deploy [cpjobqueue/deploy@e35aa05]: Rollback. Disable GC metric reporting T181333 [15:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:26] T181333: cpjobqueue spamming statsd metrics - https://phabricator.wikimedia.org/T181333 [15:45:50] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@e35aa05]: Rollback. Disable GC metric reporting T181333 (duration: 00m 31s) [15:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:11] !log unban statsd traffic from scb on graphite1001 - T181333 [16:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:18] T181333: cpjobqueue spamming statsd metrics - https://phabricator.wikimedia.org/T181333 [16:05:32] !log kartik@tin Started deploy [cxserver/deploy@11aecc9]: Update cxserver to 0c242c0, Pin service-runner to 2.4.2 [16:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:01] !log kartik@tin Finished deploy [cxserver/deploy@11aecc9]: Update cxserver to 0c242c0, Pin service-runner to 2.4.2 (duration: 03m 29s) [16:09:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:27] godog: Done. See if that helps. [16:10:23] kart_: will do! looking ok so far [16:10:42] cool. [19:22:50] PROBLEM - MariaDB Slave Lag: s5 on db1051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 492.42 seconds [19:23:21] Lets see [19:23:37] It is an vslow slave [19:23:40] marostegui: I'm here too [19:23:54] o/ [19:24:01] \o [19:24:03] haha [19:24:26] I am going to bet RAID problems [19:24:36] I do not see traffic shifts [19:24:39] <_joe_> i am here too ftr [19:25:21] OK: optimal, 1 logical, 2 physical, WriteBack policy [19:25:30] raid looks good yeah [19:27:00] there is one disk with 2000 errors, though [19:27:10] that is the other typical RAID issue [19:27:17] increasing? [19:27:28] bad disk not yet removed from the group [19:27:31] there are few slow queries on tendril in the last few minutes, probably effect [19:27:58] yes, increasing [19:28:07] then that disk is probably the cause [19:28:27] I would remove the disk, it is out of warranty anyway [19:28:57] so just dumb raid controller that didn't detect it yet? [19:29:19] and properly documented: https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Disks_about_to_fail [19:29:27] jynus: which disk? [19:29:32] Enclosure Device ID: 32 [19:29:32] Slot Number: 3 [19:29:49] Agreed? [19:29:51] interesting, the predictive failure is for slot 3 and 8 [19:30:36] let's log: 32:3 [19:30:39] PD:1 slot 3 and PD 0 slot 8 have predictive failure counts of 4 both [19:30:54] 32:3 yes [19:31:14] This would be it: megacli -PDOffline -PhysDrv \[32:3\] -aALL [19:31:23] yes [19:31:32] or '[32:3]' [19:31:37] at your will [19:31:39] !log Set 32:3 disk to offline on db1051 [19:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:54] done [19:32:05] 52406 (564953513s/0x0001/CRIT) - VD 00/0 is now DEGRADED [19:32:11] lag recovering [19:32:13] lag reducing [19:32:16] ah nice [19:32:32] volans: we will get a normal degraded raid task, right? [19:32:38] so we don't have to open it :) [19:32:41] marostegui: yep, give it few minutes [19:32:45] checking now: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=6&fullscreen&orgId=1&from=now-6h&to=now [19:32:46] sure sure [19:32:49] I'm forcing the recheck [19:33:01] it is going down [19:33:28] db1051 is getting old [19:33:32] and so is db1052 [19:33:34] PROBLEM - MegaRAID on db1051 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [19:33:36] ACKNOWLEDGEMENT - MegaRAID on db1051 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T181345 [19:33:40] 10Operations, 10ops-eqiad: Degraded RAID on db1051 - https://phabricator.wikimedia.org/T181345#3786844 (10ops-monitoring-bot) [19:33:41] :) [19:33:44] there you go ;) [19:34:03] lag back to 0 [19:34:41] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1051 - https://phabricator.wikimedia.org/T181345#3786849 (10Marostegui) We forced this disk to OFFLINE as it was having errors and it was making db1051 lag. As soon as it was set to OFFLINE the lag started to recover @Cmjohnson can we get it replaced? Thanks! [19:34:56] it it had been trafic changes, we could have pooled it as 0 [19:35:00] RECOVERY - MariaDB Slave Lag: s5 on db1051 is OK: OK slave_sql_lag Replication lag: 0.43 seconds [19:35:05] to slow down the whole wiki [19:35:14] but it was good that it was at 0 [19:35:19] so it didn't affect it [19:35:47] for much, I assume semisinc kicked in for a while [19:36:12] so, nothing else to see here [19:36:39] https://www.youtube.com/watch?v=5NNOrp_83RU [19:37:06] lol [19:37:13] * volans back off [19:38:46] I am off too