[00:00:02] 6operations, 10MobileFrontend, 5Patch-For-Review: Stale copy of Wikipedia:Featured picture candidates/Peacock butterfly and other mobile pages being served to users - https://phabricator.wikimedia.org/T121594#1891986 (10Legoktm) Purges for new edits should be working now and I confirmed it on my test page. ?... [00:00:07] 6operations, 10MobileFrontend, 5Patch-For-Review: Stale copy of Wikipedia:Featured picture candidates/Peacock butterfly and other mobile pages being served to users - https://phabricator.wikimedia.org/T121594#1891987 (10dr0ptp4kt) Yay, https://en.m.wikipedia.org/wiki/Star_Wars:_The_Force_Awakens is now fresh... [00:04:18] bblack, legoktm, ori thx for the help. disturbance in the force begone [00:06:50] :) [00:23:09] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: puppet fail [00:25:32] 6operations, 10Fundraising Tech Backlog, 10Fundraising-Backlog, 10Traffic, 5Patch-For-Review: Switch Varnish's GeoIP code to libmaxminddb/GeoIP2 - https://phabricator.wikimedia.org/T99226#1892105 (10awight) [00:48:39] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [00:48:49] RECOVERY - Disk space on restbase1008 is OK: DISK OK [00:49:23] !log restbase1008: removed 5% root reserve from data partition with tune2fs -m 0 /dev/mapper/restbase1008--vg-srv [00:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:01:54] !log entire restbase cluster: removed 5% root reserve from data partition with tune2fs -m 0 /dev/mapper/restbase$NODE--vg-{srv,var} [01:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:14:19] PROBLEM - puppet last run on mw2151 is CRITICAL: CRITICAL: puppet fail [01:41:41] RECOVERY - puppet last run on mw2151 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:22:04] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 08m 53s) [02:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:28:56] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Dec 19 02:28:56 UTC 2015 (duration 6m 53s) [02:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:02:31] PROBLEM - Disk space on restbase1008 is CRITICAL: DISK CRITICAL - free space: /srv 73363 MB (3% inode=99%) [05:21:48] (03PS1) 10EBernhardson: Remove variables for unused experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260176 [05:27:59] https://grafana-admin.wikimedia.org/dashboard/db/old-home [06:30:50] PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:20] PROBLEM - puppet last run on analytics1021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:30] PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:50] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:31] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:21] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:44:59] PROBLEM - puppet last run on ms-be2019 is CRITICAL: CRITICAL: Puppet has 1 failures [06:52:11] PROBLEM - puppet last run on mw2015 is CRITICAL: CRITICAL: puppet fail [06:56:10] RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [07:01:40] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [07:02:21] RECOVERY - puppet last run on analytics1021 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [07:02:31] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [07:02:31] RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:02:59] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:12:29] RECOVERY - puppet last run on ms-be2019 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [07:13:59] PROBLEM - puppet last run on mw2013 is CRITICAL: CRITICAL: puppet fail [07:15:50] RECOVERY - puppet last run on mw2015 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [07:31:09] RECOVERY - Disk space on restbase1008 is OK: DISK OK [07:37:30] RECOVERY - puppet last run on mw2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:16:45] (03CR) 10Awight: [C: 04-1] "Thanks for starting this port!" (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/253619 (https://phabricator.wikimedia.org/T99226) (owner: 10Faidon Liambotis) [08:24:37] (03CR) 10Awight: "We might be able to avoid any strdup, actually--see https://maxmind.github.io/libmaxminddb/index.html -> "Pointer Values and MMDB_close()"" [puppet] - 10https://gerrit.wikimedia.org/r/253619 (https://phabricator.wikimedia.org/T99226) (owner: 10Faidon Liambotis) [08:46:54] <_joe_> AaronSchulz: if you feel like it :P It's wildly outdated, I didn't abandon it just because I wanted a reminder [10:00:51] PROBLEM - puppet last run on mw2052 is CRITICAL: CRITICAL: puppet fail [10:07:18] !log CI jobs for MediaWiki were broken because of cssjanus dependency. Should be fixed once mw/core https://gerrit.wikimedia.org/r/#/c/260169/ lands [10:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:28:31] RECOVERY - puppet last run on mw2052 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [10:35:11] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 698 [10:36:33] apergos: poke [10:40:09] RECOVERY - check_mysql on db1008 is OK: Uptime: 67451 Threads: 66 Questions: 4230089 Slow queries: 759 Opens: 818 Flush tables: 2 Open tables: 342 Queries per second avg: 62.713 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:42:46] Steinsplitter: yes? [11:40:21] PROBLEM - Disk space on restbase1008 is CRITICAL: DISK CRITICAL - free space: /srv 73955 MB (3% inode=99%) [13:13:07] 6operations, 10Fundraising-Backlog, 10Traffic, 10Unplanned-Sprint-Work, 3Fundraising Sprint Zapp: Firefox SPDY is buggy and is causing geoip lookup errors for IPv6 users - https://phabricator.wikimedia.org/T121922#1892741 (10BBlack) >>! In T121922#1891801, @faidon wrote: > The only fix I see for now (bes... [13:28:52] (03PS1) 10Dzahn: tor: set family config option [puppet] - 10https://gerrit.wikimedia.org/r/260185 [13:34:18] (03CR) 10Dzahn: "https://atlas.torproject.org/#details/265E5ABBF2E5846443901E878146060148EFEA44" [puppet] - 10https://gerrit.wikimedia.org/r/260015 (owner: 10Faidon Liambotis) [13:48:50] (03PS1) 10Dzahn: snapshot: mv wikidatadumps classes to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/260186 [13:59:40] (03PS1) 10Dzahn: quarry: use one file per class, autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/260187 [14:13:33] (03PS1) 10Dzahn: openstack: rename queue-server to queue_server [puppet] - 10https://gerrit.wikimedia.org/r/260188 [14:16:04] (03PS2) 10Dzahn: openstack: rename queue-server to queue_server [puppet] - 10https://gerrit.wikimedia.org/r/260188 [14:19:33] (03PS1) 10Dzahn: contint: rename git-daemon to git_daemon [puppet] - 10https://gerrit.wikimedia.org/r/260189 [14:22:19] (03PS1) 10Dzahn: contint: rename publish-console to publish_console [puppet] - 10https://gerrit.wikimedia.org/r/260190 [14:27:20] 6operations, 10MobileFrontend, 5Patch-For-Review: Stale copy of Wikipedia:Featured picture candidates/Peacock butterfly and other mobile pages being served to users - https://phabricator.wikimedia.org/T121594#1892754 (10BBlack) The purge of the mobile cache (of things older than the fix) is still slowly ongo... [14:27:56] (03PS1) 10Dzahn: bacula: rename mysql-bipe to mysql_bpipe [puppet] - 10https://gerrit.wikimedia.org/r/260191 [14:31:36] (03PS1) 10Dzahn: openstack: rename openstack-manager class [puppet] - 10https://gerrit.wikimedia.org/r/260192 [14:39:20] PROBLEM - puppet last run on wtp2018 is CRITICAL: CRITICAL: puppet fail [14:49:14] (03PS1) 10Dzahn: installserver: rename classes with dash characters [puppet] - 10https://gerrit.wikimedia.org/r/260193 [14:51:09] RECOVERY - puppet last run on wtp2018 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [14:53:18] (03PS2) 10Dzahn: installserver: rename classes with dash characters [puppet] - 10https://gerrit.wikimedia.org/r/260193 [14:56:05] (03PS1) 10Dzahn: logging: rename webrequest-multicast [puppet] - 10https://gerrit.wikimedia.org/r/260194 [14:59:26] (03PS1) 10Dzahn: base: rename standard-packages to standard_packages [puppet] - 10https://gerrit.wikimedia.org/r/260196 [15:00:36] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 340 [15:01:27] PROBLEM - MariaDB Slave Lag: s3 on db2050 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 314 [15:01:33] (03PS1) 10Dzahn: lvs: rename interface-tweaks to interface_tweaks [puppet] - 10https://gerrit.wikimedia.org/r/260198 [15:02:26] PROBLEM - MariaDB Slave Lag: s3 on db2036 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 347 [15:03:37] RECOVERY - MariaDB Slave Lag: s3 on db2050 is OK: OK slave_sql_lag Seconds_Behind_Master: 0 [15:03:57] that seems backwards (on puppet-lint's part I mean), to prefer _ to - [15:04:21] give that classes/defines often have to do with hosts, and host naming rules disallow underscores but not dashes [15:04:22] what's going on with dbs ? [15:04:28] I donno, but it's codfw [15:04:30] <_joe_> s3 lag [15:04:34] <_joe_> in codfw [15:04:36] RECOVERY - MariaDB Slave Lag: s3 on db2036 is OK: OK slave_sql_lag Seconds_Behind_Master: 24 [15:04:43] <_joe_> and there it goes [15:04:47] every freaking saturday [15:04:56] <_joe_> it's just a way to gather us here to cheer each other :P [15:05:46] we could patch icinga to send some kind of happy holidays message with each SMS :P [15:06:49] (03PS1) 10Dzahn: puppet-lint: rm exceptions for dashes in class names [puppet] - 10https://gerrit.wikimedia.org/r/260201 [15:06:55] it already sends literal unicode hearts :) [15:07:02] py did it [15:07:45] (03CR) 10jenkins-bot: [V: 04-1] puppet-lint: rm exceptions for dashes in class names [puppet] - 10https://gerrit.wikimedia.org/r/260201 (owner: 10Dzahn) [15:08:17] (03CR) 10Dzahn: [C: 04-1] "the changes linked in the message would have to be merged first" [puppet] - 10https://gerrit.wikimedia.org/r/260201 (owner: 10Dzahn) [15:08:21] yea, it's every week. cu later [15:09:01] (03PS2) 10Dzahn: puppet-lint: rm exceptions for dashes in class names [puppet] - 10https://gerrit.wikimedia.org/r/260201 (https://phabricator.wikimedia.org/T93645) [15:09:57] (03CR) 10jenkins-bot: [V: 04-1] puppet-lint: rm exceptions for dashes in class names [puppet] - 10https://gerrit.wikimedia.org/r/260201 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [15:10:30] PROBLEM - puppet last run on bast2001 is CRITICAL: CRITICAL: puppet fail [15:11:06] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 324 [15:12:40] Ahhh its dallas ok [15:12:46] (03PS1) 10Dzahn: mw_rc_irc: rename irc-echo to irc_echo [puppet] - 10https://gerrit.wikimedia.org/r/260202 [15:13:50] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [15:14:30] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [15:14:36] !log krenair@tin Synchronized wmf-config/CommonSettings-labs.php: https://gerrit.wikimedia.org/r/#/c/259611/ - noop for prod, other than making icinga stop complaining (duration: 00m 31s) [15:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:15:15] (03CR) 10Alex Monk: "please don't forget to merge and sync these in prod, otherwise icinga starts alerting about unmerged patches" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259611 (owner: 10Aaron Schulz) [15:17:37] PROBLEM - MariaDB Slave Lag: s3 on db2036 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 347 [15:18:49] RECOVERY - puppet last run on bast2001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [15:19:46] RECOVERY - MariaDB Slave Lag: s3 on db2036 is OK: OK slave_sql_lag Seconds_Behind_Master: 0 [15:22:04] (03PS4) 10BBlack: varnish: appropriate t2-fe -> t1-be backend_options [puppet] - 10https://gerrit.wikimedia.org/r/260048 (https://phabricator.wikimedia.org/T121564) [15:23:26] RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Seconds_Behind_Master: 0 [15:24:09] <_joe_> [15:25:46] a large set off insert and updates on s3 in codfw servers... [15:29:35] (03CR) 10BBlack: [C: 032] varnish: appropriate t2-fe -> t1-be backend_options [puppet] - 10https://gerrit.wikimedia.org/r/260048 (https://phabricator.wikimedia.org/T121564) (owner: 10BBlack) [15:31:50] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: puppet fail [15:35:41] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:02:27] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 371 [16:06:28] RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Seconds_Behind_Master: 0 [16:15:41] PROBLEM - puppet last run on db2046 is CRITICAL: CRITICAL: puppet fail [16:35:08] (03PS1) 10BBlack: Revert "Revert "Revert "cache_text/mobile: send randomized pass traffic directly to t1 backends""" [puppet] - 10https://gerrit.wikimedia.org/r/260204 (https://phabricator.wikimedia.org/T121564) [16:35:35] (03CR) 10BBlack: [C: 032 V: 032] Revert "Revert "Revert "cache_text/mobile: send randomized pass traffic directly to t1 backends""" [puppet] - 10https://gerrit.wikimedia.org/r/260204 (https://phabricator.wikimedia.org/T121564) (owner: 10BBlack) [16:37:52] so, the above revert-revert-revert is going to cause some random puppetfail critical spam in here, due to a race condition [16:38:10] I think my salt command will minimize it, but either way they're not really critical, they fix themselves on the next run worst case [16:38:16] (and no actual traffic gets affected) [16:41:19] PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: Puppet has 1 failures [16:42:49] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: Puppet has 1 failures [16:42:49] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: Puppet has 1 failures [16:43:09] RECOVERY - puppet last run on db2046 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:45:10] RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [16:46:09] PROBLEM - puppet last run on cp2021 is CRITICAL: CRITICAL: Puppet has 1 failures [16:46:40] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [16:47:29] 6operations, 10MobileFrontend, 5Patch-For-Review: Stale copy of Wikipedia:Featured picture candidates/Peacock butterfly and other mobile pages being served to users - https://phabricator.wikimedia.org/T121594#1892835 (10BBlack) ^ Above was off by 1h, it's 17:30 UTC when it ends. [16:48:10] PROBLEM - puppet last run on cp2023 is CRITICAL: CRITICAL: Puppet has 1 failures [16:49:00] PROBLEM - puppet last run on cp2016 is CRITICAL: CRITICAL: Puppet has 1 failures [16:49:30] PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: Puppet has 1 failures [16:50:00] RECOVERY - puppet last run on cp2021 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [16:50:31] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [16:50:50] PROBLEM - puppet last run on cp2003 is CRITICAL: CRITICAL: Puppet has 1 failures [16:51:21] PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: Puppet has 1 failures [16:51:29] RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [16:52:29] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: Puppet has 1 failures [16:52:50] RECOVERY - puppet last run on cp2003 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [16:53:00] PROBLEM - puppet last run on cp2010 is CRITICAL: CRITICAL: Puppet has 1 failures [16:54:01] RECOVERY - puppet last run on cp2023 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [16:54:21] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [16:54:49] RECOVERY - puppet last run on cp2016 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [16:55:00] RECOVERY - puppet last run on cp2010 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [16:55:19] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [17:23:38] (03CR) 10Faidon Liambotis: "Daniel, your comment (and amend) are not correct. tor-eqiad-2.wikimedia.org doesn't exist -- tor-eqiad-1.wm.org was used on purpose, as th" [puppet] - 10https://gerrit.wikimedia.org/r/260015 (owner: 10Faidon Liambotis) [17:26:16] 6operations, 10Fundraising-Backlog, 10Traffic, 10Unplanned-Sprint-Work, 3Fundraising Sprint Zapp: Firefox SPDY is buggy and is causing geoip lookup errors for IPv6 users - https://phabricator.wikimedia.org/T121922#1892853 (10faidon) On IPv4-only clients, geoiplookup.wm.org isn't used at all (the GeoIP co... [17:33:11] 6operations, 10MobileFrontend, 5Patch-For-Review: Stale copy of Wikipedia:Featured picture candidates/Peacock butterfly and other mobile pages being served to users - https://phabricator.wikimedia.org/T121594#1892860 (10BBlack) 5Open>3Resolved [17:41:44] (03PS1) 10Hoo man: Add a .bash_profile for myself [puppet] - 10https://gerrit.wikimedia.org/r/260206 [17:56:07] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 316 [17:56:10] (03CR) 10Hoo man: [C: 031] "Makes sense, diff looks good." [puppet] - 10https://gerrit.wikimedia.org/r/260186 (owner: 10Dzahn) [18:01:08] PROBLEM - MariaDB Slave Lag: s3 on db2050 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 332 [18:01:26] PROBLEM - MariaDB Slave Lag: s3 on db2043 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 331 [18:01:36] PROBLEM - MariaDB Slave Lag: s3 on db2036 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 357 [18:01:50] hmm [18:03:20] yea, do we need to tell jynus or not because codfw [18:03:35] RECOVERY - MariaDB Slave Lag: s3 on db2043 is OK: OK slave_sql_lag Seconds_Behind_Master: 0 [18:03:36] its daytime for him [18:03:44] but now as i say that, it recovers ;P [18:03:45] and there is the recovery.. [18:04:01] it seems like some spike [18:04:05] yes [18:04:16] and it seems to happen in regular intervals [18:04:22] yeah [18:04:38] Well, these alerts page [18:04:47] so he probably already knows [18:04:49] hoo: yes they do [18:05:01] well, it's obviously not traffic affecting [18:05:08] he uses an android app afaik [18:05:27] ok then [18:05:35] RECOVERY - MariaDB Slave Lag: s3 on db2050 is OK: OK slave_sql_lag Seconds_Behind_Master: 0 [18:05:55] <_joe_> this is going to make all of us take pages more lightly in the long run [18:05:55] RECOVERY - MariaDB Slave Lag: s3 on db2036 is OK: OK slave_sql_lag Seconds_Behind_Master: 0 [18:06:06] <_joe_> apart from being expensive [18:06:07] yes, it's becoming a problem [18:06:20] each page does cost actual money, yep [18:07:13] maybe we should make codfw critical = false then [18:07:14] so, first time it happened today they were actual com_update and com_inserts for db2018's family [18:08:23] but now I don't see yet something [18:09:34] well db1027's family of which db2018 is a slave [18:09:46] so the first time today it was an actual thing that had changed [18:09:48] but now ? [18:09:54] (03PS2) 10Dzahn: snapshot: mv wikidatadumps classes to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/260186 [18:10:17] I am not sure what to do for it [18:12:25] i checked emails to alerts@ and this is a Saturday thing [18:12:35] something happens every 7d? [18:13:06] Dec 12 .same thing. just differnt cluster [18:13:14] oh, you would think so but it's not the same thing [18:13:25] hmm..ok [18:13:27] so last saturday was broken replication because a replicated statement [18:13:44] tried to touch tables not existent in codfw [18:14:03] 2 weeks ago it was a bug with chinese wikipedia, unicode and mariadb [18:14:17] :p heh, ok [18:14:18] but this saturday thing is becoming a nuisance indeed [18:15:59] so, today it's s 3 [18:16:00] s3 [18:16:31] almost all recovered, there is one single icinga warning left now about slave lag. [18:16:41] dbstore2002 [18:17:12] crit, but that host isnt paging [18:17:22] those are anyway intentionally slave lagged IIRC [18:17:25] 24h or something [18:17:26] ok [18:17:27] at least in eqiad [18:18:34] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 314 [18:21:12] ok, we got an increase in jobs queued [18:21:27] PROBLEM - puppet last run on ms-be3003 is CRITICAL: CRITICAL: puppet fail [18:21:42] not much, some 100 jobs. around 12% [18:22:09] scratch that [18:22:12] it's 50% increase [18:22:15] from 600 to 900 [18:23:34] PROBLEM - MariaDB Slave Lag: s3 on db2050 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 337 [18:23:41] PROBLEM - MariaDB Slave Lag: s3 on db2043 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 340 [18:23:49] refreshLinks [18:24:00] PROBLEM - MariaDB Slave Lag: s3 on db2036 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 370 [18:25:39] akosiaris: hey, can I help somehow? [18:26:41] godog: so, it's not critical as in it's codfw [18:26:49] but something IS happening in eqiad to cause that [18:27:05] so if you feel like debugging, please do help [18:27:46] for now, all I 've actually pinpointed it's an increase in the job rate [18:27:53] and an increase in refreshLinks jobs [18:28:03] sorry, i have to go to bring because family [18:28:09] eh, go [18:28:20] is it really something in eqiad? dbtree shows db2018 having 0 lag, and the lagged slaves replicate from that [18:29:14] why would refreshLinks cause only a few codfw slaves to lag? :/ [18:29:51] MW refreshLinks jobs shouldn't really be touching any of the codfw db servers [18:30:09] Krenair: you got a point... maybe it's just exacerbating something preexisting on those codfw servers... [18:30:15] no, they are not [18:30:31] has the db traffic encryption been ruled out as a cause? [18:30:42] or the migration anyways [18:31:06] nope [18:31:11] could be that too [18:31:16] that being said, why today [18:31:33] and the slaves ARE catching up eventually [18:31:42] then something happens and they lag behind once more [18:32:00] PROBLEM - MariaDB Slave Lag: s3 on db2050 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 307 [18:32:08] PROBLEM - MariaDB Slave Lag: s3 on db2043 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 307 [18:33:50] good question why today, their master db2018 has been restarted and reconfigured on the 15th [18:34:07] RECOVERY - MariaDB Slave Lag: s3 on db2050 is OK: OK slave_sql_lag Seconds_Behind_Master: 0 [18:34:26] RECOVERY - MariaDB Slave Lag: s3 on db2043 is OK: OK slave_sql_lag Seconds_Behind_Master: 0 [18:35:01] how do you see slave lag over time in tendril? [18:35:56] godog: replication graph [18:36:02] 3rd graph from the top [18:36:22] gah, thanks, I was blindly scanning for "lag" [18:37:54] so db2018 is lagging something like 1-5 secs occasionally from db1027 [18:38:07] but it's own slaves go well over 300 secs [18:38:24] if it was only one slave I 'd say hardware error. Disk, network, something [18:38:25] but 3 ? [18:38:37] RECOVERY - MariaDB Slave Lag: s3 on db2036 is OK: OK slave_sql_lag Seconds_Behind_Master: 0 [18:39:29] yeah it feels like mysql itself [18:41:21] ok so all of db2018's slaves lag [18:41:37] checked disks on db2018, none have SMART flag [18:43:28] RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Seconds_Behind_Master: 0 [18:44:38] oh and no media error counts [18:44:46] I think that rules out the disk subsystem [18:47:24] /etc/my.cnf diff looks like this, https://phabricator.wikimedia.org/P2443 [18:48:19] RECOVERY - puppet last run on ms-be3003 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [18:54:42] !log scheduled maintenance of s3 slave lag on db2036, db2043, db2050, db2057 (all of db2018's family that pages) to effectively silence pages while debugging. Check is flapping since 15:00 UTC today [18:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:55:09] looking puppet there's https://gerrit.wikimedia.org/r/#/c/259222/ heh, not sure if could be relevant [18:55:12] but that does not make the problem go away obviously [18:56:06] indeed, but thanks! [18:56:06] godog: m4... we got problem's with s3 [18:56:15] problems* [18:56:27] yeah unrelated, nevermind [18:58:32] akosiaris: I have to go, I'm pageable of course if need be! [18:58:43] godog: yeah ok, thanks for stopping by [18:58:57] np, sorry I couldn't help more heh [19:09:22] so, there already in warning https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=slave+lag%3A+s3 [19:23:29] PROBLEM - Disk space on restbase1008 is CRITICAL: DISK CRITICAL - free space: /srv 74185 MB (3% inode=99%) [19:25:58] ACKNOWLEDGEMENT - Disk space on restbase1008 is CRITICAL: DISK CRITICAL - free space: /srv 73260 MB (3% inode=99%): gwicke Keeping an eye on this compactions are catching up with very limited space. If 10G left, free with nodetool stop -- COMPACTION. See https://phabricator.wikimedia.org/T121535. [19:49:27] !log killed gmond on db2036. it was clearly misbehaving and running since Jan 02. db2036 was not listed on the ganglia web interface. killing the orphaned process and restarting seems to have fixed it [19:49:31] chasemp: ^ [19:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:49:48] I might not have fixed the mysql's yet but at least I fixed this [19:49:58] btw seems like some innodb locks or something [19:50:55] Huh, interesting [20:34:28] RECOVERY - Disk space on restbase1008 is OK: DISK OK [20:48:46] !log restbase1004: `systemctl mask cassandra` in preparation for the decommission finishing [20:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:47:49] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:52:56] PROBLEM - LVS HTTP IPv4 on zotero.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:53:39] PROBLEM - zotero on sca1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:54:55] RECOVERY - LVS HTTP IPv4 on zotero.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.0 200 OK - 62 bytes in 0.008 second response time [21:55:01] <_joe_> !log restarted zotero on sca1001, various OOM messages [21:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:55:38] RECOVERY - zotero on sca1001 is OK: HTTP OK: HTTP/1.0 200 OK - 62 bytes in 0.008 second response time [21:55:42] _joe_: just tried the same, but don't have permissions to restart it [21:55:48] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [21:56:45] <_joe_> gwicke: it's handled [21:57:03] yup, thanks [21:57:49] RECOVERY - zotero on sca1002 is OK: HTTP OK: HTTP/1.0 200 OK - 62 bytes in 0.012 second response time [21:57:49] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [22:31:30] did hovercards brake ? [22:31:58] there's a bug for it matanya [22:32:31] https://phabricator.wikimedia.org/T121777 [22:32:34] thanks Krenair [23:04:09] PROBLEM - RAID on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:18:29] PROBLEM - very high load average likely xfs on ms-be2019 is CRITICAL: CRITICAL - load average: 121.91, 100.17, 63.14 [23:22:20] PROBLEM - swift-object-server on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:22:29] PROBLEM - swift-object-auditor on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:22:29] PROBLEM - puppet last run on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:22:39] PROBLEM - SSH on ms-be2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:22:48] PROBLEM - swift-container-auditor on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:22:48] PROBLEM - salt-minion processes on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:22:58] PROBLEM - swift-account-reaper on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:22:58] PROBLEM - swift-account-replicator on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:23:18] PROBLEM - swift-account-server on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:23:29] PROBLEM - swift-container-replicator on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:23:58] PROBLEM - swift-container-server on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:23:58] PROBLEM - dhclient process on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:23:59] PROBLEM - swift-object-updater on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:23:59] PROBLEM - swift-account-auditor on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:23:59] PROBLEM - swift-object-replicator on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:24:08] PROBLEM - swift-container-updater on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:32:38] PROBLEM - puppet last run on db2062 is CRITICAL: CRITICAL: puppet fail [23:46:38] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [24.0] [23:50:28] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [23:56:09] PROBLEM - puppet last run on db2011 is CRITICAL: CRITICAL: puppet fail [23:59:58] RECOVERY - puppet last run on db2062 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures