[00:01:09] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 9 below the confidence bounds [00:06:18] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 8 below the confidence bounds [00:16:18] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [00:35:39] PROBLEM - puppet last run on mw2154 is CRITICAL: CRITICAL: puppet fail [01:05:57] RECOVERY - puppet last run on mw2154 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:14:08] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [02:27:37] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [02:32:02] !log l10nupdate@tin Synchronized php-1.27.0-wmf.2/cache/l10n: l10nupdate for 1.27.0-wmf.2 (duration: 08m 15s) [02:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:35:18] (03CR) 10Jforrester: "> This can probably be renamed to wg.. and removed from CommonSettings.php now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246703 (owner: 10Bartosz Dziewoński) [02:36:53] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.2) at 2015-10-19 02:36:53+00:00 [02:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:42:48] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [02:47:48] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [02:57:57] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 2 below the confidence bounds [02:59:58] !log l10nupdate@tin Synchronized php-1.27.0-wmf.3/cache/l10n: l10nupdate for 1.27.0-wmf.3 (duration: 08m 40s) [03:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:04:50] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.3) at 2015-10-19 03:04:49+00:00 [03:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:14:48] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 1 below the confidence bounds [04:03:08] (03PS2) 10KartikMistry: Add Debian package for apertium-isl [debs/contenttranslation/apertium-isl] - 10https://gerrit.wikimedia.org/r/244415 (https://phabricator.wikimedia.org/T114988) [04:13:48] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [05:32:17] PROBLEM - Analytics Cassanda CQL query interface on aqs1001 is CRITICAL: Connection refused [05:33:57] RECOVERY - Analytics Cassanda CQL query interface on aqs1001 is OK: TCP OK - 0.002 second response time on port 9042 [06:00:38] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:12:17] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 55001 bytes in 0.505 second response time [06:17:27] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:19:09] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Oct 19 06:19:09 UTC 2015 (duration 19m 8s) [06:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:20:38] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 55001 bytes in 0.386 second response time [06:29:09] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:17] PROBLEM - puppet last run on mw2146 is CRITICAL: CRITICAL: puppet fail [06:29:18] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 3 failures [06:29:27] PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:30] PROBLEM - puppet last run on db2064 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:48] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:07] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:31:09] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:18] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:29] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:47] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:48] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:55:38] RECOVERY - puppet last run on db2064 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:56:17] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:56:18] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:56:19] RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:56:29] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:56:59] (03PS1) 10Muehlenhoff: gadolinium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247191 [06:57:37] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:49] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:57] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:07] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:23] (03PS1) 10Muehlenhoff: gallium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247192 [06:59:38] RECOVERY - puppet last run on mw2146 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:49] (03PS1) 10Muehlenhoff: graphite1001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247193 [07:01:37] (03PS1) 10Muehlenhoff: graphite2001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247194 [07:02:23] (03PS1) 10Muehlenhoff: hafnium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247195 [07:03:08] (03PS1) 10Muehlenhoff: helium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247196 [07:04:02] (03PS1) 10Muehlenhoff: heze: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247197 [07:04:54] (03PS1) 10Muehlenhoff: holmium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247198 [07:06:44] (03PS1) 10Muehlenhoff: hooft: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247199 [07:07:40] (03PS1) 10Muehlenhoff: install2001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247200 [07:08:27] (03PS1) 10Muehlenhoff: iridium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247201 [07:09:19] (03PS1) 10Muehlenhoff: iron: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247202 [07:10:08] (03PS1) 10Muehlenhoff: krypton: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247203 [07:10:51] (03PS1) 10Muehlenhoff: labcontrol2001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247204 [07:11:42] (03PS1) 10Muehlenhoff: labnet1001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247205 [07:12:31] (03PS1) 10Muehlenhoff: labnet1002: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247206 [07:17:28] (03PS1) 10Muehlenhoff: labnodepool1001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247207 [07:18:09] (03PS1) 10Muehlenhoff: labsdb100[1-3]: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247208 [07:18:49] (03PS1) 10Muehlenhoff: labservices1001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247209 [07:20:00] (03PS1) 10Muehlenhoff: lithium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247210 [07:21:01] (03PS1) 10Muehlenhoff: Mark analytics1021 as a spare [puppet] - 10https://gerrit.wikimedia.org/r/247211 [07:22:26] (03PS1) 10Muehlenhoff: ms-fe*: Fully use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247212 [07:23:42] (03PS1) 10Muehlenhoff: neon: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247213 [07:24:34] (03PS1) 10Muehlenhoff: nitrogen: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247214 [07:25:52] (03PS1) 10Muehlenhoff: openldap: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247215 [07:26:40] (03PS1) 10Muehlenhoff: osm: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247216 [07:27:27] (03PS1) 10Muehlenhoff: palladium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247217 [07:28:07] (03PS1) 10Muehlenhoff: pc100*: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247218 [07:31:00] (03PS1) 10Muehlenhoff: planet: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247219 [07:31:36] (03PS1) 10Muehlenhoff: potassium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247220 [07:32:37] (03PS1) 10Muehlenhoff: silver: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247221 [07:33:36] (03PS1) 10Muehlenhoff: stat1001: Fully use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247222 [07:34:26] (03PS1) 10Muehlenhoff: statistics::cruncher: Move standard and base::firewall includes into the role [puppet] - 10https://gerrit.wikimedia.org/r/247223 [07:34:58] (03PS1) 10KartikMistry: Apertium: Add new language pairs for Apertium MT [puppet] - 10https://gerrit.wikimedia.org/r/247224 (https://phabricator.wikimedia.org/T111902) [07:35:19] (03PS1) 10Muehlenhoff: subra/suhail: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247225 [07:36:10] (03PS1) 10Muehlenhoff: tendril: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247226 [07:37:09] (03PS1) 10Muehlenhoff: terbium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247227 [07:39:02] (03PS1) 10Muehlenhoff: tor: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247228 [07:39:37] (03PS1) 10Muehlenhoff: uranium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247229 [07:40:23] (03PS1) 10Muehlenhoff: Use role keyword for dbstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/247230 [07:41:07] (03PS1) 10Muehlenhoff: Use the role keyword for analytics::mysql::meta [puppet] - 10https://gerrit.wikimedia.org/r/247231 [07:43:01] (03PS1) 10Muehlenhoff: Use the role keyword for the major roles on analytics1027 [puppet] - 10https://gerrit.wikimedia.org/r/247233 [07:44:13] (03PS1) 10Muehlenhoff: Mark graphite1002 as testsystem [puppet] - 10https://gerrit.wikimedia.org/r/247234 [07:44:59] (03PS1) 10Muehlenhoff: Move the authdns servers to the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247235 [07:46:30] (03PS1) 10Muehlenhoff: smokeping: Don't ensure latest [puppet] - 10https://gerrit.wikimedia.org/r/247236 [07:47:34] (03PS1) 10Muehlenhoff: Use testsystem role for einsteinium [puppet] - 10https://gerrit.wikimedia.org/r/247237 [07:48:37] (03PS1) 10Muehlenhoff: Use testsystem role for pybal-test* [puppet] - 10https://gerrit.wikimedia.org/r/247238 [07:49:45] (03PS1) 10Muehlenhoff: Use testsystem role for ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/247239 [07:50:24] (03PS1) 10Muehlenhoff: Use testsystem role for virt100[5-7] [puppet] - 10https://gerrit.wikimedia.org/r/247240 [08:06:23] (03PS2) 10Muehlenhoff: Move base::debdeploy into the base class [puppet] - 10https://gerrit.wikimedia.org/r/246220 [08:16:57] (03CR) 10Alexandros Kosiaris: [C: 04-1] "is-sv, not is-ms" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/247224 (https://phabricator.wikimedia.org/T111902) (owner: 10KartikMistry) [08:18:31] (03PS2) 10Alexandros Kosiaris: helium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247196 (owner: 10Muehlenhoff) [08:18:38] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] helium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247196 (owner: 10Muehlenhoff) [08:19:26] (03PS2) 10Alexandros Kosiaris: heze: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247197 (owner: 10Muehlenhoff) [08:19:33] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] heze: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247197 (owner: 10Muehlenhoff) [08:21:04] akosiaris: ok. Looks I'm sleepy :/ [08:22:00] (03PS3) 10Muehlenhoff: Move base::debdeploy into the base class [puppet] - 10https://gerrit.wikimedia.org/r/246220 [08:22:40] (03CR) 10Alexandros Kosiaris: [C: 04-1] "some lines are codfw specific and should stay under codfw hierarchy. The rest should be moved though" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/246992 (owner: 10Dzahn) [08:22:55] (03PS2) 10KartikMistry: Apertium: Add new language pairs for Apertium MT [puppet] - 10https://gerrit.wikimedia.org/r/247224 (https://phabricator.wikimedia.org/T111902) [08:22:57] (03CR) 10Muehlenhoff: [C: 032 V: 032] Move base::debdeploy into the base class [puppet] - 10https://gerrit.wikimedia.org/r/246220 (owner: 10Muehlenhoff) [08:23:08] (03CR) 10Alexandros Kosiaris: [C: 032] Use testsystem role for pybal-test* [puppet] - 10https://gerrit.wikimedia.org/r/247238 (owner: 10Muehlenhoff) [08:23:12] (03PS2) 10Alexandros Kosiaris: Use testsystem role for pybal-test* [puppet] - 10https://gerrit.wikimedia.org/r/247238 (owner: 10Muehlenhoff) [08:23:32] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Use testsystem role for pybal-test* [puppet] - 10https://gerrit.wikimedia.org/r/247238 (owner: 10Muehlenhoff) [08:23:46] kart_: :-) [08:24:41] (03CR) 10Alexandros Kosiaris: [C: 032] Apertium: Add new language pairs for Apertium MT [puppet] - 10https://gerrit.wikimedia.org/r/247224 (https://phabricator.wikimedia.org/T111902) (owner: 10KartikMistry) [08:24:46] (03PS3) 10Alexandros Kosiaris: Apertium: Add new language pairs for Apertium MT [puppet] - 10https://gerrit.wikimedia.org/r/247224 (https://phabricator.wikimedia.org/T111902) (owner: 10KartikMistry) [08:26:48] (03PS2) 10Muehlenhoff: Make spare role include base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/246388 [08:27:14] (03CR) 10Alexandros Kosiaris: [C: 032] uranium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247229 (owner: 10Muehlenhoff) [08:27:18] (03PS2) 10Alexandros Kosiaris: uranium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247229 (owner: 10Muehlenhoff) [08:28:18] (03CR) 10Alexandros Kosiaris: [C: 032] uranium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247229 (owner: 10Muehlenhoff) [08:28:40] (03CR) 10Muehlenhoff: [C: 032 V: 032] Make spare role include base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/246388 (owner: 10Muehlenhoff) [08:28:55] (03PS3) 10Muehlenhoff: Make spare role include base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/246388 [08:29:07] (03CR) 10Muehlenhoff: [V: 032] Make spare role include base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/246388 (owner: 10Muehlenhoff) [08:29:25] 6operations, 10Gerrit: Rename "Dzahn" to "Daniel Zahn" in Gerrit - https://phabricator.wikimedia.org/T113792#1734076 (10hashar) For past commits, we can add a `.mailmap` file at the root of the repository. That let you alias an old name/alias with a new pair. Example: https://github.com/git/git/blob/master/.... [08:29:55] kart_: Oct 15 10:38:31 sca1001 kernel: [22547927.888029] apertium-postch[33828]: segfault at 61 ip 00007ff4fd5b21f4 sp 00007fffea81a2f0 error 4 in libapertium3-3.3.so.0.0.0[7ff4fd502000+133000] [08:30:35] just one up to now but if this occurs again we need to investigate... library segfaults are not nice [08:34:22] (03PS2) 10Alexandros Kosiaris: smokeping: Don't ensure latest [puppet] - 10https://gerrit.wikimedia.org/r/247236 (owner: 10Muehlenhoff) [08:34:29] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] smokeping: Don't ensure latest [puppet] - 10https://gerrit.wikimedia.org/r/247236 (owner: 10Muehlenhoff) [08:36:51] (03CR) 10Muehlenhoff: "I was under the impression I had already merged" [puppet] - 10https://gerrit.wikimedia.org/r/246831 (owner: 10Muehlenhoff) [08:38:27] (03CR) 10Alexandros Kosiaris: "tbh, not sure it is worth moving all that into hiera. It's not like we will ever have to override (at least for most of them)." [puppet] - 10https://gerrit.wikimedia.org/r/244814 (owner: 10John F. Lewis) [08:42:16] (03CR) 10Alexandros Kosiaris: [V: 04-1] Add Debian package for apertium-isl (031 comment) [debs/contenttranslation/apertium-isl] - 10https://gerrit.wikimedia.org/r/244415 (https://phabricator.wikimedia.org/T114988) (owner: 10KartikMistry) [08:42:42] (03PS2) 10Muehlenhoff: aqs: Include base::firewall in the role [puppet] - 10https://gerrit.wikimedia.org/r/246330 [08:43:30] (03PS1) 10Alexandros Kosiaris: maps: Move the hiera per hosts file to their rightful place [puppet] - 10https://gerrit.wikimedia.org/r/247243 [08:43:48] (03CR) 10Muehlenhoff: [C: 032 V: 032] aqs: Include base::firewall in the role [puppet] - 10https://gerrit.wikimedia.org/r/246330 (owner: 10Muehlenhoff) [08:46:11] (03PS2) 10Alexandros Kosiaris: package_builder: Keep environments updated [puppet] - 10https://gerrit.wikimedia.org/r/247084 [08:46:20] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] package_builder: Keep environments updated [puppet] - 10https://gerrit.wikimedia.org/r/247084 (owner: 10Alexandros Kosiaris) [08:46:55] (03PS2) 10Alexandros Kosiaris: maps: Move the hiera per hosts file to their rightful place [puppet] - 10https://gerrit.wikimedia.org/r/247243 [08:47:01] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] maps: Move the hiera per hosts file to their rightful place [puppet] - 10https://gerrit.wikimedia.org/r/247243 (owner: 10Alexandros Kosiaris) [08:48:28] (03PS2) 10Muehlenhoff: etherpad1001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246978 [08:48:41] (03PS3) 10KartikMistry: Add Debian package for apertium-isl [debs/contenttranslation/apertium-isl] - 10https://gerrit.wikimedia.org/r/244415 (https://phabricator.wikimedia.org/T114988) [08:48:43] (03CR) 10Muehlenhoff: [C: 032 V: 032] etherpad1001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246978 (owner: 10Muehlenhoff) [08:50:24] (03PS2) 10KartikMistry: Add Debian package of apertium-isl-eng [debs/contenttranslation/apertium-isl-eng] - 10https://gerrit.wikimedia.org/r/244416 (https://phabricator.wikimedia.org/T114988) [08:51:21] (03PS1) 10Alexandros Kosiaris: package_builder: Mute cron update commands [puppet] - 10https://gerrit.wikimedia.org/r/247245 [08:52:13] (03CR) 10Alexandros Kosiaris: [C: 032] package_builder: Mute cron update commands [puppet] - 10https://gerrit.wikimedia.org/r/247245 (owner: 10Alexandros Kosiaris) [08:52:36] (03PS4) 10KartikMistry: Add Debian package for apertium-isl [debs/contenttranslation/apertium-isl] - 10https://gerrit.wikimedia.org/r/244415 (https://phabricator.wikimedia.org/T114988) [09:19:37] PROBLEM - Analytics Cassanda CQL query interface on aqs1002 is CRITICAL: Connection timed out [09:21:08] RECOVERY - Analytics Cassanda CQL query interface on aqs1002 is OK: TCP OK - 0.011 second response time on port 9042 [09:26:27] PROBLEM - Analytics Cassanda CQL query interface on aqs1002 is CRITICAL: Connection timed out [09:38:39] (03CR) 10Hashar: "eswiki is used for content translation testing and should be reopened." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234594 (https://phabricator.wikimedia.org/T109157) (owner: 10MarcoAurelio) [09:39:40] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 9 below the confidence bounds [09:43:37] (03PS3) 10Muehlenhoff: videoscalers: move base::firewall include into the role [puppet] - 10https://gerrit.wikimedia.org/r/242180 [09:43:47] (03CR) 10Muehlenhoff: [C: 032 V: 032] videoscalers: move base::firewall include into the role [puppet] - 10https://gerrit.wikimedia.org/r/242180 (owner: 10Muehlenhoff) [09:44:58] RECOVERY - Analytics Cassanda CQL query interface on aqs1002 is OK: TCP OK - 0.014 second response time on port 9042 [09:48:06] (03PS3) 10Muehlenhoff: imagescalers: move base::firewall include into the role [puppet] - 10https://gerrit.wikimedia.org/r/243122 [09:50:47] (03CR) 10Muehlenhoff: [C: 032 V: 032] imagescalers: move base::firewall include into the role [puppet] - 10https://gerrit.wikimedia.org/r/243122 (owner: 10Muehlenhoff) [09:52:03] 6operations, 6Release-Engineering-Team: Proposal: Reroute all requests from WMF Office IPs to a machine/machines similar to the canary cluster - https://phabricator.wikimedia.org/T115631#1734290 (10hashar) If I remember correctly, we already discussed this back in January 2015 (on the ops list?) and rejected i... [09:56:57] PROBLEM - puppet last run on mw1153 is CRITICAL: CRITICAL: Puppet has 1 failures [09:59:31] (03PS3) 10Muehlenhoff: Move base::firewall includes for roles on krypton [puppet] - 10https://gerrit.wikimedia.org/r/245968 [09:59:55] (03CR) 10Muehlenhoff: [C: 032 V: 032] Move base::firewall includes for roles on krypton [puppet] - 10https://gerrit.wikimedia.org/r/245968 (owner: 10Muehlenhoff) [10:17:09] RECOVERY - puppet last run on mw1153 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:18:31] (03PS2) 10Muehlenhoff: wdqs: Move the ferm rules into the role [puppet] - 10https://gerrit.wikimedia.org/r/246224 [10:18:42] (03CR) 10Muehlenhoff: [C: 032 V: 032] wdqs: Move the ferm rules into the role [puppet] - 10https://gerrit.wikimedia.org/r/246224 (owner: 10Muehlenhoff) [10:20:39] (03CR) 10DCausse: Refactor monolog handling for kafka logs (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240615 (https://phabricator.wikimedia.org/T103505) (owner: 10EBernhardson) [10:20:52] (03PS2) 10Muehlenhoff: Move base::firewall include into the openldap::corp role [puppet] - 10https://gerrit.wikimedia.org/r/245972 [10:21:10] (03CR) 10Muehlenhoff: [C: 032 V: 032] Move base::firewall include into the openldap::corp role [puppet] - 10https://gerrit.wikimedia.org/r/245972 (owner: 10Muehlenhoff) [10:25:40] (03PS14) 10DCausse: Refactor monolog handling for kafka logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240615 (https://phabricator.wikimedia.org/T103505) (owner: 10EBernhardson) [10:26:09] (03CR) 10DCausse: Refactor monolog handling for kafka logs (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240615 (https://phabricator.wikimedia.org/T103505) (owner: 10EBernhardson) [10:26:52] (03PS2) 10Muehlenhoff: dnsrecursor: Move base::firewall include into the role [puppet] - 10https://gerrit.wikimedia.org/r/244692 [10:28:01] (03CR) 10Muehlenhoff: [C: 032 V: 032] dnsrecursor: Move base::firewall include into the role [puppet] - 10https://gerrit.wikimedia.org/r/244692 (owner: 10Muehlenhoff) [10:30:36] (03CR) 10DCausse: [C: 031] Refactor monolog handling for kafka logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240615 (https://phabricator.wikimedia.org/T103505) (owner: 10EBernhardson) [10:37:19] (03PS1) 10MarcoAurelio: Modifying logo for anwiki per request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247253 (https://phabricator.wikimedia.org/T115841) [10:41:32] (03CR) 10DCausse: [C: 031] Enable config for all three search clusters, but only write to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246443 (https://phabricator.wikimedia.org/T115434) (owner: 10EBernhardson) [10:54:08] PROBLEM - Analytics Cassanda CQL query interface on aqs1003 is CRITICAL: Connection refused [10:55:57] RECOVERY - Analytics Cassanda CQL query interface on aqs1003 is OK: TCP OK - 0.002 second response time on port 9042 [11:01:08] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [11:02:35] (03CR) 10Glaisher: "I meant remove $wgForeignUploadTargets = $wmgForeignUploadTargets; from CommonSettings.php and just configure it with wgForeignUploadTarge" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246703 (owner: 10Bartosz Dziewoński) [11:20:31] (03CR) 10Hashar: contint: install npm/grunt-cli with npm (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/244748 (https://phabricator.wikimedia.org/T113903) (owner: 10Hashar) [11:20:45] (03PS3) 10Hashar: contint: install npm/grunt-cli with npm [puppet] - 10https://gerrit.wikimedia.org/r/244748 (https://phabricator.wikimedia.org/T113903) [11:23:18] (03CR) 10Muehlenhoff: "Looks good to me (we predominantly use present instead of 'present', though)" [puppet] - 10https://gerrit.wikimedia.org/r/247008 (https://phabricator.wikimedia.org/T115348) (owner: 10Dzahn) [11:23:58] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [11:24:22] (03Abandoned) 10Muehlenhoff: Enable ferm on tin [puppet] - 10https://gerrit.wikimedia.org/r/240083 (owner: 10Muehlenhoff) [11:25:38] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [11:28:00] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/247005 (https://phabricator.wikimedia.org/T115348) (owner: 10Dzahn) [11:35:43] (03PS1) 10Muehlenhoff: backup::storage: Move base::firewall include inside the role [puppet] - 10https://gerrit.wikimedia.org/r/247259 [11:47:39] (03PS1) 10Muehlenhoff: Include base::firewall in the phabricator role [puppet] - 10https://gerrit.wikimedia.org/r/247260 [11:57:37] (03PS1) 10Muehlenhoff: bromine: Move the base::firewall includes into the roles [puppet] - 10https://gerrit.wikimedia.org/r/247262 [12:00:54] (03PS2) 10Muehlenhoff: Add salt grains for hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/246944 [12:04:10] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add salt grains for hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/246944 (owner: 10Muehlenhoff) [12:11:40] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Add Debian package for apertium-isl [debs/contenttranslation/apertium-isl] - 10https://gerrit.wikimedia.org/r/244415 (https://phabricator.wikimedia.org/T114988) (owner: 10KartikMistry) [12:15:44] (03Abandoned) 10Muehlenhoff: Enable ferm on db1046 [puppet] - 10https://gerrit.wikimedia.org/r/235430 (owner: 10Muehlenhoff) [12:16:55] (03PS2) 10Muehlenhoff: Add salt grains for hadoop master and standby [puppet] - 10https://gerrit.wikimedia.org/r/246945 [12:22:15] !log uploaded to apt.wikimedia.org trusty-wikimedia: apertium-isl_0.1.0-1 [12:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:24:45] (03PS3) 10Muehlenhoff: Add salt grains for hadoop master and standby [puppet] - 10https://gerrit.wikimedia.org/r/246945 [12:34:53] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add salt grains for hadoop master and standby [puppet] - 10https://gerrit.wikimedia.org/r/246945 (owner: 10Muehlenhoff) [12:39:55] (03PS2) 10Muehlenhoff: Add salt grains for gitblit [puppet] - 10https://gerrit.wikimedia.org/r/246946 [12:44:00] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Add Debian package of apertium-isl-eng [debs/contenttranslation/apertium-isl-eng] - 10https://gerrit.wikimedia.org/r/244416 (https://phabricator.wikimedia.org/T114988) (owner: 10KartikMistry) [12:47:13] !log uploaded to apt.wikimedia.org trusty-wikimedia: apertium-isl-eng_0.1.0~r20599-1 [12:47:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:50:52] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add salt grains for gitblit [puppet] - 10https://gerrit.wikimedia.org/r/246946 (owner: 10Muehlenhoff) [12:52:05] (03PS2) 10Muehlenhoff: Add salt grains for spares [puppet] - 10https://gerrit.wikimedia.org/r/246947 [12:57:13] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add salt grains for spares [puppet] - 10https://gerrit.wikimedia.org/r/246947 (owner: 10Muehlenhoff) [13:19:08] PROBLEM - puppet last run on mw2127 is CRITICAL: CRITICAL: puppet fail [13:21:13] (03PS1) 10Alexandros Kosiaris: cassandra: allow default instance overrides [puppet] - 10https://gerrit.wikimedia.org/r/247265 [13:25:43] 6operations, 7Mail: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1734770 (10Elitre) (I've also been getting translatewiki.net discussions' notifications in my Spam folder for a while. I remember bringing this up on IRC but can't remember wha... [13:26:09] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [13:27:49] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [13:29:43] (03PS2) 10Alexandros Kosiaris: cassandra: allow default instance overrides [puppet] - 10https://gerrit.wikimedia.org/r/247265 [13:30:20] (03CR) 10jenkins-bot: [V: 04-1] cassandra: allow default instance overrides [puppet] - 10https://gerrit.wikimedia.org/r/247265 (owner: 10Alexandros Kosiaris) [13:34:09] (03PS3) 10Alexandros Kosiaris: cassandra: allow default instance overrides [puppet] - 10https://gerrit.wikimedia.org/r/247265 [13:34:53] (03CR) 10jenkins-bot: [V: 04-1] cassandra: allow default instance overrides [puppet] - 10https://gerrit.wikimedia.org/r/247265 (owner: 10Alexandros Kosiaris) [13:38:01] (03PS4) 10Alexandros Kosiaris: cassandra: allow default instance overrides [puppet] - 10https://gerrit.wikimedia.org/r/247265 [13:38:39] 6operations, 10MediaWiki-extensions-BounceHandler, 5Patch-For-Review: BounceHandler still HTTP posting to test2.wikipedia.org API in production - https://phabricator.wikimedia.org/T114984#1734777 (10Jgreen) >>! In T114984#1720653, @01tonythomas wrote: > @Jgreen can we see if things are working sometime tonig... [13:38:54] 6operations, 10MediaWiki-extensions-BounceHandler, 5Patch-For-Review: BounceHandler still HTTP posting to test2.wikipedia.org API in production - https://phabricator.wikimedia.org/T114984#1734778 (10Jgreen) 5Open>3Resolved p:5Triage>3Normal [13:39:06] (03PS5) 10Alexandros Kosiaris: cassandra: allow default instance overrides [puppet] - 10https://gerrit.wikimedia.org/r/247265 [13:39:13] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] cassandra: allow default instance overrides [puppet] - 10https://gerrit.wikimedia.org/r/247265 (owner: 10Alexandros Kosiaris) [13:47:40] (03CR) 10Alexandros Kosiaris: [C: 032] backup::storage: Move base::firewall include inside the role [puppet] - 10https://gerrit.wikimedia.org/r/247259 (owner: 10Muehlenhoff) [13:47:45] (03PS2) 10Alexandros Kosiaris: backup::storage: Move base::firewall include inside the role [puppet] - 10https://gerrit.wikimedia.org/r/247259 (owner: 10Muehlenhoff) [13:47:49] (03CR) 10Alexandros Kosiaris: [V: 032] backup::storage: Move base::firewall include inside the role [puppet] - 10https://gerrit.wikimedia.org/r/247259 (owner: 10Muehlenhoff) [13:48:10] (03PS2) 10Muehlenhoff: Add salt grains for aqs [puppet] - 10https://gerrit.wikimedia.org/r/246948 [13:48:27] RECOVERY - puppet last run on mw2127 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:51:25] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add salt grains for aqs [puppet] - 10https://gerrit.wikimedia.org/r/246948 (owner: 10Muehlenhoff) [13:54:19] (03PS2) 10Muehlenhoff: Add salt grains for releases role [puppet] - 10https://gerrit.wikimedia.org/r/246949 [13:56:09] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add salt grains for releases role [puppet] - 10https://gerrit.wikimedia.org/r/246949 (owner: 10Muehlenhoff) [14:00:48] (03PS2) 10Muehlenhoff: Add salt grains for restbase canaries [puppet] - 10https://gerrit.wikimedia.org/r/246950 [14:01:25] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add salt grains for restbase canaries [puppet] - 10https://gerrit.wikimedia.org/r/246950 (owner: 10Muehlenhoff) [14:04:33] (03PS2) 10Muehlenhoff: Add salt grains for restbase [puppet] - 10https://gerrit.wikimedia.org/r/246951 [14:09:08] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add salt grains for restbase [puppet] - 10https://gerrit.wikimedia.org/r/246951 (owner: 10Muehlenhoff) [14:12:42] (03CR) 10Filippo Giunchedi: [C: 04-1] ms-fe*: Fully use the role keyword (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/247212 (owner: 10Muehlenhoff) [14:13:15] (03CR) 10Filippo Giunchedi: [C: 031] Move base::firewall into the syslog role [puppet] - 10https://gerrit.wikimedia.org/r/245959 (owner: 10Muehlenhoff) [14:19:30] (03CR) 10Filippo Giunchedi: [C: 04-1] "unrelated to this change, but I'm not sure why dsh module should know about scap (missing context)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/224829 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis) [14:20:19] PROBLEM - Analytics Cassanda CQL query interface on aqs1002 is CRITICAL: Connection refused [14:22:31] moritzm: ^ [14:22:43] (03PS1) 10Muehlenhoff: Remove explicit includes of role::diamond and role::ntp [puppet] - 10https://gerrit.wikimedia.org/r/247277 [14:22:49] ah no nevermind, unrelated change for salt grains [14:22:51] (03Abandoned) 10Muehlenhoff: ms-fe*: Fully use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247212 (owner: 10Muehlenhoff) [14:23:32] godog: these should indeed be unrelated (and IIRC I saw this earlier the day already) [14:24:09] (03PS2) 10Muehlenhoff: Move base::firewall into the syslog role [puppet] - 10https://gerrit.wikimedia.org/r/245959 [14:24:48] (03CR) 10Muehlenhoff: [C: 032 V: 032] Move base::firewall into the syslog role [puppet] - 10https://gerrit.wikimedia.org/r/245959 (owner: 10Muehlenhoff) [14:25:27] moritzm: yeah nevermind I misread the change title earlier, taking a look [14:25:37] RECOVERY - Analytics Cassanda CQL query interface on aqs1002 is OK: TCP OK - 0.001 second response time on port 9042 [14:26:13] (03CR) 10Alexandros Kosiaris: [C: 032] osm: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247216 (owner: 10Muehlenhoff) [14:26:24] (03PS2) 10Alexandros Kosiaris: osm: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247216 (owner: 10Muehlenhoff) [14:26:51] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] osm: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247216 (owner: 10Muehlenhoff) [14:27:21] looks like it just got restarted I think, joal ? [14:28:09] akosiaris: ping me if segfault happens. Seems we need to update hfst/aperitum. I'm checking upstream changelogs. [14:28:42] kart_: ok [14:31:12] (03CR) 10Alexandros Kosiaris: [C: 032] openldap: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247215 (owner: 10Muehlenhoff) [14:32:35] (03CR) 10Ottomata: [C: 031] Use the role keyword for the major roles on analytics1027 [puppet] - 10https://gerrit.wikimedia.org/r/247233 (owner: 10Muehlenhoff) [14:32:48] (03CR) 10Ottomata: [C: 031] Use the role keyword for analytics::mysql::meta [puppet] - 10https://gerrit.wikimedia.org/r/247231 (owner: 10Muehlenhoff) [14:34:46] !log canary deploy (a4c55e40) to restbase1001.eqiad [14:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:35:30] godog: indeed [14:36:02] godog: I am currently experimenting with data loading --> still a bit too much [14:36:10] godog: Will reduce the load asap [14:37:57] joal: ack! please !log here so it is easy to keep an audit [14:38:22] (03PS2) 10Alexandros Kosiaris: openldap: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247215 (owner: 10Muehlenhoff) [14:38:23] godog: Yes, I have done that on analytics chan for some restart, but not every [14:38:45] Will do in here as well (and be disciplined :) [14:38:49] godog: --^ [14:39:02] joal: hehe ok! thanks! [14:44:26] (03CR) 10Alexandros Kosiaris: [C: 032] openldap: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247215 (owner: 10Muehlenhoff) [14:44:38] (03PS2) 10Muehlenhoff: Use the role keyword for analytics::mysql::meta [puppet] - 10https://gerrit.wikimedia.org/r/247231 [14:45:00] (03CR) 10Muehlenhoff: [C: 032 V: 032] Use the role keyword for analytics::mysql::meta [puppet] - 10https://gerrit.wikimedia.org/r/247231 (owner: 10Muehlenhoff) [14:45:14] (03PS3) 10Muehlenhoff: Use the role keyword for analytics::mysql::meta [puppet] - 10https://gerrit.wikimedia.org/r/247231 [14:45:23] (03CR) 10Muehlenhoff: [C: 032 V: 032] Use the role keyword for analytics::mysql::meta [puppet] - 10https://gerrit.wikimedia.org/r/247231 (owner: 10Muehlenhoff) [14:46:45] (03PS15) 10EBernhardson: Refactor monolog handling for kafka logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240615 (https://phabricator.wikimedia.org/T103505) [14:51:56] (03PS2) 10Muehlenhoff: Add salt grains for etcd [puppet] - 10https://gerrit.wikimedia.org/r/246952 [14:52:28] !log disabled puppet on maps-test200{1,2,4}. Debugging cassandra multi-instance setup aftermath. Not to be enabled [14:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:53:45] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add salt grains for etcd [puppet] - 10https://gerrit.wikimedia.org/r/246952 (owner: 10Muehlenhoff) [14:56:33] (03PS2) 10Muehlenhoff: Add salt grains for test systems [puppet] - 10https://gerrit.wikimedia.org/r/246953 [14:57:42] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add salt grains for test systems [puppet] - 10https://gerrit.wikimedia.org/r/246953 (owner: 10Muehlenhoff) [14:59:16] 6operations, 10ops-codfw: power off Codfw-Cisco Servers - https://phabricator.wikimedia.org/T115372#1734976 (10Papaul) @RobH Do we have a particular wipe disk software that we use? [14:59:44] (03PS1) 10Ottomata: Add Varnish reqstats diamond collector for upload caches [puppet] - 10https://gerrit.wikimedia.org/r/247282 (https://phabricator.wikimedia.org/T83580) [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151019T1500). [15:00:06] (03CR) 10Filippo Giunchedi: [C: 031] Remove explicit includes of role::diamond and role::ntp [puppet] - 10https://gerrit.wikimedia.org/r/247277 (owner: 10Muehlenhoff) [15:01:11] Well, let's start SWATting. [15:01:56] (03PS2) 10Muehlenhoff: Use the role keyword for the major roles on analytics1027 [puppet] - 10https://gerrit.wikimedia.org/r/247233 [15:03:04] (03CR) 10Muehlenhoff: [C: 032 V: 032] Use the role keyword for the major roles on analytics1027 [puppet] - 10https://gerrit.wikimedia.org/r/247233 (owner: 10Muehlenhoff) [15:03:33] anomie: the Flow cherry pick fails due to a jscs error which is unrelated https://gerrit.wikimedia.org/r/#/c/247283/ :/ [15:03:41] you probably want to force merge [15:03:58] ugh... [15:04:36] (03PS2) 10Muehlenhoff: Remove explicit includes of role::diamond and role::ntp [puppet] - 10https://gerrit.wikimedia.org/r/247277 [15:04:57] (03CR) 10Ottomata: [C: 031] Mark analytics1021 as a spare [puppet] - 10https://gerrit.wikimedia.org/r/246389 (owner: 10Muehlenhoff) [15:05:09] (03CR) 10Muehlenhoff: [C: 032 V: 032] Remove explicit includes of role::diamond and role::ntp [puppet] - 10https://gerrit.wikimedia.org/r/247277 (owner: 10Muehlenhoff) [15:07:00] 6operations, 6Release-Engineering-Team: Proposal: Reroute all requests from WMF Office IPs to a machine/machines similar to the canary cluster - https://phabricator.wikimedia.org/T115631#1734987 (10faidon) The sets of "people working on the code and/or are willing to do QA" and "people who are connected to the... [15:07:49] !log anomie@tin Synchronized php-1.27.0-wmf.2/extensions/Flow/includes/Search/Connection.php: SWAT: Backport [[gerrit:246134]] because the thing it fixed suddenly started breaking unit tests, preventing other merges (duration: 00m 18s) [15:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:08:36] (03PS2) 10Muehlenhoff: Mark analytics1021 as a spare [puppet] - 10https://gerrit.wikimedia.org/r/246389 [15:08:38] (03PS2) 10Ottomata: Add Varnish reqstats diamond collector for upload caches [puppet] - 10https://gerrit.wikimedia.org/r/247282 (https://phabricator.wikimedia.org/T83580) [15:08:54] (03CR) 10Ottomata: [C: 032 V: 032] Add Varnish reqstats diamond collector for upload caches [puppet] - 10https://gerrit.wikimedia.org/r/247282 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [15:09:00] (03CR) 10Muehlenhoff: [C: 032 V: 032] Mark analytics1021 as a spare [puppet] - 10https://gerrit.wikimedia.org/r/246389 (owner: 10Muehlenhoff) [15:09:10] !log anomie@tin Synchronized php-1.27.0-wmf.3/extensions/Flow/includes/Search/Connection.php: SWAT: Backport [[gerrit:246134]] because the thing it fixed suddenly started breaking unit tests, preventing other merges (duration: 00m 18s) [15:09:15] !log enabling varnish reqstats diamond collector on all upload caches [15:09:17] (03PS3) 10Muehlenhoff: Mark analytics1021 as a spare [puppet] - 10https://gerrit.wikimedia.org/r/246389 [15:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:09:22] bblack ^^^ [15:09:25] just fyi. [15:09:38] (03CR) 10Muehlenhoff: [V: 032] Mark analytics1021 as a spare [puppet] - 10https://gerrit.wikimedia.org/r/246389 (owner: 10Muehlenhoff) [15:09:57] Ok, now for the real SWATting [15:10:13] anomie: the job passed test, Zuul is a bit locked due to the force merge but will resume soonish [15:10:27] the tests are fine now, congratus [15:11:41] !log Scheduling icinga downtime for CQL checks on aqs while heavily loading data - joal (me) babysites the jobs - 1 day downtime, will reiterate tomorrow if needed [15:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:12:05] sorry ops people, I didn't know you were paged by those alerts [15:12:35] !log deploying a4c55e40 to RESTBase [15:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:12:49] joal: no that didn't page btw, just on icinga [15:13:02] no too bad then [15:13:23] but still :) [15:13:34] i get paged! :) joal is doing it all for my own sanity :) [15:13:45] (03CR) 10Papaul: [C: 031] admin: add new group for datacenter ops [puppet] - 10https://gerrit.wikimedia.org/r/246848 (https://phabricator.wikimedia.org/T115718) (owner: 10Dzahn) [15:13:55] (03PS2) 10Muehlenhoff: Add salt grains for mxes [puppet] - 10https://gerrit.wikimedia.org/r/246954 [15:13:58] * joal is trying to take care of its admin ! [15:14:51] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add salt grains for mxes [puppet] - 10https://gerrit.wikimedia.org/r/246954 (owner: 10Muehlenhoff) [15:15:09] (03PS1) 10Ottomata: analytics1017 is now also a spare [puppet] - 10https://gerrit.wikimedia.org/r/247285 (https://phabricator.wikimedia.org/T112113) [15:16:11] (03PS2) 10Ottomata: analytics1017 is now also a spare [puppet] - 10https://gerrit.wikimedia.org/r/247285 (https://phabricator.wikimedia.org/T112113) [15:16:15] (03CR) 10Ottomata: [C: 032 V: 032] analytics1017 is now also a spare [puppet] - 10https://gerrit.wikimedia.org/r/247285 (https://phabricator.wikimedia.org/T112113) (owner: 10Ottomata) [15:17:03] MatmaRex: Ping for SWAT [15:17:55] anomie: oh, hi [15:18:37] (for some reason i was thinking it's in 40 minutes, but it was 20 minutes ago) [15:19:46] anomie: the latter two of my patches can't be tested until the next train deployment to Commons, by the way. we've tested them on beta. [15:19:55] ok [15:20:33] ottomata: ok [15:21:40] 6operations, 6Release-Engineering-Team, 15User-greg: Proposal: Reroute all requests from WMF Office IPs to a machine/machines similar to the canary cluster - https://phabricator.wikimedia.org/T115631#1735031 (10greg) 5Open>3declined a:3greg Declining, not worth the time. [15:23:41] 6operations, 10Analytics, 6Discovery, 10EventBus, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1735035 (10GWicke) A PR adding remote schema support to the nodejs frontend is now available at https://github.com/wikimedia/restevent/pull/1. This means that we can now choose to use lo... [15:24:22] !log anomie@tin Synchronized php-1.27.0-wmf.3/includes/deferred/LinksDeletionUpdate.php: SWAT: Use specified pageId for LinksDeletionUpdate→DeleteLinksJob [[gerrit:247267]] (duration: 00m 17s) [15:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:24:28] anomie: ^ test please [15:24:38] anomie: Works! [15:25:27] :) [15:25:28] !log anomie@tin Synchronized php-1.27.0-wmf.2/includes/deferred/LinksDeletionUpdate.php: SWAT: Use specified pageId for LinksDeletionUpdate→DeleteLinksJob [[gerrit:247268]] (duration: 00m 18s) [15:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:25:33] anomie: Can't test that one without uploading test files to production wikis, which they tend not to like. [15:26:01] MatmaRex: You're up [15:26:39] right [15:27:59] anomie: Ok, the wmf.2 is now confirmed thanks to people deleting copyvios on Commons. [15:28:27] chasemp: I noticed the phab-git-ssh border-in4 term [15:28:53] chasemp: we'll (hopefully) soon going to announce our eqiad space from at least eqord, possibly codfw as well [15:29:29] chasemp: so our "border" will be wider, and terms will need to be applied to all border-ins to be effective [15:29:56] chasemp: hopefully by that time you guys will be able to use the tool I've been working on... [15:30:03] Ok [15:30:39] !log anomie@tin Synchronized php-1.27.0-wmf.3/extensions/Cite: SWAT: Display 'cite_error_references_duplicate_key' next to the affected ref [[gerrit:247256]] (duration: 00m 18s) [15:30:40] MatmaRex: ^ Test please [15:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:30:52] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: puppet fail [15:31:55] anomie: appears to work [15:32:13] ok. Now we wait for Jenkins on the wmf.2 version... [15:32:46] (it got stuck behind 247190) [15:33:35] (03PS1) 10Alexandros Kosiaris: grafana: Alter dashlist in home page [puppet] - 10https://gerrit.wikimedia.org/r/247290 [15:39:00] !log anomie@tin Synchronized php-1.27.0-wmf.2/extensions/Cite: SWAT: Display 'cite_error_references_duplicate_key' next to the affected ref [[gerrit:247255]] (duration: 00m 18s) [15:39:01] MatmaRex: ^ Test please [15:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:39:35] (03PS2) 10Anomie: Move ForeignUploadTargets config to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246703 (owner: 10Bartosz Dziewoński) [15:39:50] (03CR) 10Reedy: [C: 031] apache: remove softwarewikipedia redirects [puppet] - 10https://gerrit.wikimedia.org/r/243342 (owner: 10Dzahn) [15:40:22] (03CR) 10Reedy: [C: 031] deactivate webhostingwikipedia.com [dns] - 10https://gerrit.wikimedia.org/r/243970 (owner: 10Dzahn) [15:40:37] (03CR) 10Reedy: [C: 031] apache: remove wikimaps redirects [puppet] - 10https://gerrit.wikimedia.org/r/243348 (owner: 10Dzahn) [15:40:45] (03CR) 10Reedy: [C: 031] apache: remove wikiartpedia redirects [puppet] - 10https://gerrit.wikimedia.org/r/243341 (owner: 10Dzahn) [15:40:49] anomie: perfect [15:40:50] MatmaRex: Any reply to the discussion on https://gerrit.wikimedia.org/r/#/c/246703/ [15:40:56] (03CR) 10Reedy: [C: 031] apache: remove wikidisclosure redirects [puppet] - 10https://gerrit.wikimedia.org/r/243347 (owner: 10Dzahn) [15:41:04] (03CR) 10Reedy: [C: 031] apache: remove wikifamily redirects [puppet] - 10https://gerrit.wikimedia.org/r/243345 (owner: 10Dzahn) [15:41:10] (03CR) 10Reedy: [C: 031] apache: remove webhostingwikipedia redirects [puppet] - 10https://gerrit.wikimedia.org/r/243344 (owner: 10Dzahn) [15:41:19] anomie: i know nothing about the stuff [15:41:48] marktraceur: thoughts? ^ [15:41:56] * marktraceur looks [15:42:21] !log anomie@tin Started scap: SWAT: Add a change tag to cross-wiki uploads [[gerrit:246701]] [15:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:43:02] (03CR) 10Reedy: [C: 031] apache: remove visualwikipedia redirects [puppet] - 10https://gerrit.wikimedia.org/r/243340 (owner: 10Dzahn) [15:43:02] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds [15:43:04] MatmaRex, anomie: I guess it's very little difference, up to you guys [15:43:28] I followed the example of the other variables I saw in the config when I wrote it [15:43:44] * anomie might not make it to that one today anyway, depending on how long the scap for the i18n update in 246701 takes. [15:43:46] if the way Glaisher proposes works, i can do that, i was just moving code around [15:44:01] what [15:44:12] ah that [15:45:03] https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/CommonSettings.php#L2022-L2024 is not needed anymore [15:46:01] ok, i'll amend [15:46:56] (03PS3) 10Bartosz Dziewoński: Move ForeignUploadTargets config to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246703 [15:47:05] but don't blame me if it doesn't work ;) [15:48:10] looks okay now [15:48:17] Let's hope that it will work :) [15:48:23] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 9 below the confidence bounds [15:48:36] (03PS2) 10Alex Monk: Modifying logo for anwiki per request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247253 (https://phabricator.wikimedia.org/T115841) (owner: 10MarcoAurelio) [15:50:57] (03CR) 10Zfilipin: [C: 031] zuul: Add zuul-test-repo helper script [puppet] - 10https://gerrit.wikimedia.org/r/247031 (owner: 10Legoktm) [15:53:23] (03CR) 10Alex Monk: [C: 031] Modifying logo for anwiki per request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247253 (https://phabricator.wikimedia.org/T115841) (owner: 10MarcoAurelio) [15:53:43] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 9 below the confidence bounds [15:55:18] anomie: ping when SWAT is done :) [15:55:50] Can someone of the swat team deploy this too? https://gerrit.wikimedia.org/r/#/c/246709/ [15:55:50] kart_: Once the logmsgbot reports that a SCAP is completed, it's done. [15:56:00] Luke081515: Too late for this morning. [15:56:51] anomie: hm, ok, but can we deploy that later? I t need to be deployed till the 23th october [15:56:57] *it [15:57:12] anomie: Thanks. [15:57:26] Luke081515: You can add it to https://wikitech.wikimedia.org/wiki/Deployments for the evening window, at 23:00 UTC [15:57:33] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:57:39] (03CR) 10BryanDavis: "> I'm not sure why dsh module should know about scap" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/224829 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis) [15:57:39] ok, tahnks [15:58:08] MatmaRex: BTW, there's not going to be time this morning for your last patch. [15:58:45] anomie: yeah. i'll move it to tomorrow [16:03:16] anomie: added: https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=195242&oldid=195238 [16:03:53] you missed your name Luke081515 [16:03:59] oh [16:04:49] Krenair: Where I can add it? I don't see it [16:04:59] see the example line above [16:07:20] should be fixed now [16:11:22] (03PS2) 10Luke081515: Rename two namespaces at bswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247093 (https://phabricator.wikimedia.org/T115812) [16:14:31] (03CR) 10Luke081515: "Renaming NS_USER_TALK:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247093 (https://phabricator.wikimedia.org/T115812) (owner: 10Luke081515) [16:20:28] jouncebot: next [16:20:29] In 3 hour(s) and 39 minute(s): Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151019T2000) [16:21:51] ostriches: wondering if I can do deployment while current scap is going on [16:22:20] ostriches: https://phabricator.wikimedia.org/T112626 [16:22:42] * bd808 looks in topic for on-call opsen and finds that section missing [16:23:05] (03CR) 10Mobrovac: [C: 04-1] "Not ready to go yet." [puppet] - 10https://gerrit.wikimedia.org/r/245887 (https://phabricator.wikimedia.org/T114830) (owner: 10Milimetric) [16:23:17] kart_: Not generally a good idea ;) [16:23:27] yep. So waiting. [16:23:36] It looks I will miss window :/ [16:24:34] Yours should be quick to get deployed... [16:24:46] anomie: How far through scap are you? [16:24:48] mutante: I just wanted to make sure that https://phabricator.wikimedia.org/T115548 was seen so the 3-day or whatever wait period could start [16:24:54] 6operations, 7Mail: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1735201 (10greg) Just for observation purposes: so far, it seems the only people confirming this on this task are WMF employees, ie, those with their email being filtered for t... [16:25:07] Reedy: "Started sync-apaches" is at 97% [16:25:23] 11 left [16:26:29] 10Ops-Access-Requests, 6operations: Requesting access to analytics-privatedata-users for Bryan Davis - https://phabricator.wikimedia.org/T115548#1735202 (10Dzahn) a:3Dzahn [16:27:24] Reedy: Now it's doing "Started scap-rebuild-cdbs" [16:28:27] 10Ops-Access-Requests, 6operations: adding tjones to analytics-privatedata-users (hive and webrequests) - https://phabricator.wikimedia.org/T115880#1735214 (10TJones) 3NEW [16:28:53] bd808: On the subject of dsh...did you realize we still include it on bast1001? [16:29:18] historical raisons, if I had to guess [16:29:46] ostriches: Oh, I did not but yeah shouldn't effect scap either way [16:30:32] (03PS1) 10Dzahn: admin: add bd808 to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/247295 (https://phabricator.wikimedia.org/T115548) [16:32:20] (03PS1) 10Chad: Remove dsh from bast1001 [puppet] - 10https://gerrit.wikimedia.org/r/247296 [16:33:25] wth, nickserv [16:34:13] (03CR) 10Chad: "Proposed I2dad0011 to remove it from the one non-scap place that still uses it. That'll make it even easier to just move inside scap where" [puppet] - 10https://gerrit.wikimedia.org/r/224829 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis) [16:35:30] bd808: If we land my change we can just move that dsh crud into the scap module where it really belongs. [16:35:49] (03PS2) 10Dzahn: admin: add bd808 to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/247295 (https://phabricator.wikimedia.org/T115548) [16:36:19] 6operations, 10ops-codfw: power off Codfw-Cisco Servers - https://phabricator.wikimedia.org/T115372#1735263 (10RobH) So the two options I've used in the past are: * Take an Ubuntu or Debian live USB/Disc and boot off it, then run wipe from the command line against each individual disk. * Take a dban (http://w... [16:37:27] !log anomie@tin Finished scap: SWAT: Add a change tag to cross-wiki uploads [[gerrit:246701]] (duration: 55m 05s) [16:37:28] Reedy, kart_: ^ Done. Sorry that went over so long. [16:38:21] 10Ops-Access-Requests, 6operations: adding tjones to analytics-privatedata-users (hive and webrequests) - https://phabricator.wikimedia.org/T115880#1735265 (10Dzahn) a:3Dzahn [16:38:28] 10Ops-Access-Requests, 6operations: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1735266 (10Dzahn) a:3Dzahn [16:41:59] (03CR) 10Dzahn: [C: 032] "ops does not use dsh anymore either. we have removed all dsh groups quite some time ago, except these 3: mediawiki-installation parsoid " [puppet] - 10https://gerrit.wikimedia.org/r/247296 (owner: 10Chad) [16:43:09] < ostriches> bd808: If we land my change we can just move that dsh crud into the scap module where it really belongs. [16:43:12] ^ now you can [16:43:46] anomie: no worries! [16:44:09] mutante: thx! [16:45:40] ostriches: i see all the ancient groups are still on bast1001, even though we killed it all from puppet, was never cleaned up though. i'll wait with that a a few days just in case [16:45:57] sounds good [16:48:48] !log restbase deploy complete [16:53:42] 6operations, 7Mail: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1735322 (10JKrauska) @Nemo_bis: can you cross reference the 'bad'ness of the emails that's been reporting? [16:57:48] 6operations, 10Wikimedia-Mailing-lists: Provide mbox archives and add missing lists to Gmane - https://phabricator.wikimedia.org/T59246#1735351 (10JohnLewis) [16:58:03] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [17:01:19] (03CR) 10Chad: [C: 032] Enable Education Program extension at srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236231 (https://phabricator.wikimedia.org/T110619) (owner: 10MarcoAurelio) [17:01:42] (03Merged) 10jenkins-bot: Enable Education Program extension at srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236231 (https://phabricator.wikimedia.org/T110619) (owner: 10MarcoAurelio) [17:02:34] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: education program on srwiki (duration: 00m 18s) [17:03:06] !log Ran fix-stats.php on 20 wikipedias: https://phabricator.wikimedia.org/T112626 [17:03:38] (03CR) 10Aaron Schulz: [C: 032] Set page purge limiting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243363 (owner: 10Aaron Schulz) [17:03:50] wheee [17:04:04] (03Merged) 10jenkins-bot: Set page purge limiting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243363 (owner: 10Aaron Schulz) [17:04:42] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [17:05:39] AaronSchulz: Could you glance at https://gerrit.wikimedia.org/r/#/c/246281/? [17:05:46] look like log bot is sleeping? [17:06:23] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [17:06:48] 6operations, 7Mail: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1735435 (10JKrauska) All of the messages in my Spam box have this.. "Why is this message in Spam? It is in violation of Google's recommended email sender guidelines. Learn mo... [17:07:27] (03CR) 10Aaron Schulz: [C: 031] Use $wgFlaggedRevsTags instead of $wgFlaggedRevTags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246281 (owner: 10Chad) [17:07:44] AaronSchulz: thx [17:08:02] (03CR) 10Chad: [C: 032] Use $wgFlaggedRevsTags instead of $wgFlaggedRevTags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246281 (owner: 10Chad) [17:08:24] (03Merged) 10jenkins-bot: Use $wgFlaggedRevsTags instead of $wgFlaggedRevTags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246281 (owner: 10Chad) [17:09:56] !log demon@tin Synchronized wmf-config/flaggedrevs.php: use current config, less logspam (duration: 00m 17s) [17:10:42] !log aaron@tin Synchronized wmf-config/InitialiseSettings.php: Set page purge limiting (duration: 00m 18s) [17:16:10] 6operations, 7Mail: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1735492 (10JKrauska) Hrm. Not sure if this is a big deal, but we don't seem to have forward records for iridium.eqiad.wmnet or iridium.wikimedia.org This address 2620:0:86... [17:25:10] (03PS1) 10EBernhardson: Disable completion suggester experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247303 [17:29:26] 6operations, 10Wikimedia-Mailing-lists: Provide mbox archives and add missing lists to Gmane - https://phabricator.wikimedia.org/T59246#1735553 (10Dzahn) This has been done before but a long time ago. It's basically a) check if list is really public b) copy .mbox file from mailman dir to a public dir on web... [17:32:55] 6operations: Document, clean up, and make a policy for dsh groups - https://phabricator.wikimedia.org/T80415#1735569 (10Dzahn) Meanwhile most dsh groups have been deleted and it's not being used by deployment or ops. We just removed it from bast1001 as well for that reason. So i'll decline this one. [17:38:02] (03PS1) 10Chad: Move dsh code into scap where it belongs [puppet] - 10https://gerrit.wikimedia.org/r/247304 [17:39:04] (03PS1) 10Dzahn: dsh: delete remaining group files [puppet] - 10https://gerrit.wikimedia.org/r/247305 [17:40:57] (03CR) 10JGirault: [C: 031] "Seems fine to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241226 (https://phabricator.wikimedia.org/T3837) (owner: 10Smalyshev) [17:41:30] (03CR) 10Chad: [C: 04-1] "Ummm, stuff uses these..." [puppet] - 10https://gerrit.wikimedia.org/r/247305 (owner: 10Dzahn) [17:42:04] (03CR) 10Chad: "scap-test and parsoid can probably go away, but mediawiki-installation definitely not." [puppet] - 10https://gerrit.wikimedia.org/r/247305 (owner: 10Dzahn) [17:44:16] (03CR) 10Dzahn: [C: 04-2] "eh, right. just wishful thinking about mediawiki-installation" [puppet] - 10https://gerrit.wikimedia.org/r/247305 (owner: 10Dzahn) [17:44:58] (03CR) 10Dzahn: "but maybe someday the list can be in etcd?" [puppet] - 10https://gerrit.wikimedia.org/r/247305 (owner: 10Dzahn) [17:46:16] (03CR) 10Chad: "Yeah I'm not opposed to killing off these in favor of etcd." [puppet] - 10https://gerrit.wikimedia.org/r/247305 (owner: 10Dzahn) [17:51:21] 6operations, 6Release-Engineering-Team, 3Scap3: Move scap target configuration to etcd - https://phabricator.wikimedia.org/T115899#1735644 (10demon) 3NEW [17:51:29] mutante: ^ :) [17:52:27] :) cool [17:53:39] 6operations, 10Wikimedia-Mailing-lists: Provide mbox archives and add missing lists to Gmane - https://phabricator.wikimedia.org/T59246#1735659 (10JohnLewis) Sent an email to Lars. Also the stats in this ticket are *significantly* out of date. The description states 268 but in reality the number is much much... [17:56:03] 6operations: Document, clean up, and make a policy for dsh groups - https://phabricator.wikimedia.org/T80415#1735669 (10Dzahn) 5Open>3declined a:3Dzahn decline/resolved, you can interpret it that way or another, but dsh group files are almost all gone [17:56:12] 6operations, 6Release-Engineering-Team, 3Scap3: Move scap target configuration to etcd - https://phabricator.wikimedia.org/T115899#1735681 (10Krenair) Do we get anything useful out of this? Or is this task just because it's the cool thing to do? [17:56:30] 6operations, 7Database: Drop `user_daily_contribs` table from all production wikis - https://phabricator.wikimedia.org/T115711#1735684 (10jcrespo) p:5Triage>3Low Low because it will probably not block anyone. That doesn't mean it will not be done soon, as it is easy. [17:56:44] ostriches: what do we say about https://phabricator.wikimedia.org/T80395 then ? [17:57:01] i'll add a "blocked by" [17:57:49] 6operations, 6Release-Engineering-Team, 3Scap3: Move scap target configuration to etcd - https://phabricator.wikimedia.org/T115899#1735689 (10Dzahn) We get to kill the entire dsh module (remove cruft) which is not used anymore except for this. [17:58:22] 6operations, 6Release-Engineering-Team, 3Scap3: Move scap target configuration to etcd - https://phabricator.wikimedia.org/T115899#1735693 (10Dzahn) also, it's kind of a way to resolve T80395 [18:00:17] 6operations, 6Release-Engineering-Team, 3Scap3: Move scap target configuration to etcd - https://phabricator.wikimedia.org/T115899#1735706 (10Dzahn) [18:00:18] 6operations: Update dsh node groups from puppet - https://phabricator.wikimedia.org/T80395#874995 (10Dzahn) [18:00:52] 6operations: Document, clean up, and make a policy for dsh groups - https://phabricator.wikimedia.org/T80415#875247 (10Dzahn) [18:00:53] 6operations: Update dsh node groups from puppet - https://phabricator.wikimedia.org/T80395#1735709 (10Dzahn) [18:01:41] 6operations, 6Release-Engineering-Team, 3Scap3: Move scap target configuration to etcd - https://phabricator.wikimedia.org/T115899#1735711 (10demon) >>! In T115899#1735681, @Krenair wrote: > Do we get anything useful out of this? Or is this task just because it's the cool thing to do? If we've already got t... [18:04:39] 6operations, 10Wikimedia-Mailing-lists: Provide mbox archives and add missing lists to Gmane - https://phabricator.wikimedia.org/T59246#1735719 (10Dzahn) Yes, i agree, the definition of private vs. public should be based on our script "remove_from_private" that is also for offboarding. Also, haha @ 300 more l... [18:14:45] 6operations, 7Mail: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1735775 (10greg) Thanks @JKrauska! I'm only replying to a subset of your investigation: >>! In T115416#1735435, @JKrauska wrote: > Unsubscribing > A user must be able to unsu... [18:19:01] ebernhardson: any luck in the ES replication? [18:20:10] (03PS2) 10Chad: dsh: delete most remaining group files [puppet] - 10https://gerrit.wikimedia.org/r/247305 (owner: 10Dzahn) [18:20:12] (03PS2) 10Chad: Move dsh code into scap where it belongs [puppet] - 10https://gerrit.wikimedia.org/r/247304 [18:20:35] mutante: Amended yours to only delete the 2 unused ones. [18:20:48] Then rebased my change on top to move the last junk into scap module. [18:21:44] ostriches: ok!:) [18:21:52] (03PS1) 10Dzahn: icinga: fix contact name of John Lewis [puppet] - 10https://gerrit.wikimedia.org/r/247322 (https://phabricator.wikimedia.org/T105229) [18:22:56] (03PS2) 10Dzahn: icinga: fix contact name of John Lewis [puppet] - 10https://gerrit.wikimedia.org/r/247322 (https://phabricator.wikimedia.org/T105229) [18:23:09] (03CR) 10John F. Lewis: [C: 031] "if this is breaks it; two wrongs must make a right :)" [puppet] - 10https://gerrit.wikimedia.org/r/247322 (https://phabricator.wikimedia.org/T105229) (owner: 10Dzahn) [18:24:43] (03PS3) 10Dzahn: icinga: fix contact name of John Lewis [puppet] - 10https://gerrit.wikimedia.org/r/247322 (https://phabricator.wikimedia.org/T105229) [18:25:18] (03CR) 10Dzahn: [C: 032] "yea, looked at this with Alex the other day in PR. icinga didn't seem to mind the spaces" [puppet] - 10https://gerrit.wikimedia.org/r/247322 (https://phabricator.wikimedia.org/T105229) (owner: 10Dzahn) [18:25:22] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [18:25:34] yes, that's fixed with the change above [18:25:38] runs puppet on neon [18:25:41] icinga always complains just after you merge the fix :) [18:25:47] exactly :) [18:27:04] how to break icinga [18:27:44] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [18:28:03] it's not broken, it just doesnt reload until the next puppet run [18:28:42] cache'n'carry [18:29:32] JohnFLewis: ok, applied and finished [18:29:57] (03PS5) 10Chad: scap: Add co-master configuration [puppet] - 10https://gerrit.wikimedia.org/r/224829 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis) [18:30:03] mutante: so I can see if I can break icinga? :P [well, do stuff] [18:30:35] JohnFLewis: yes, limited to fermium though [18:30:44] mutante: I know I know :) [18:30:48] try leaving a comment or something [18:31:14] fermium - mailman_queue_size - Successful [18:31:26] un silencing one of the two spam checks [18:31:32] nice! [18:31:48] yea, exactly, those spam checks / queue size [18:31:48] the one we actually care about at least. io is dead really [18:32:00] right, well. maybe [18:32:08] yuvipanda: we got everything merged, and i have the patch up in mediawiki-config: https://gerrit.wikimedia.org/r/#/c/246443/ [18:32:31] JohnFLewis: can i resolve again?:) [18:32:33] ebernhardson: w00t. any idea when it'll go in? [18:32:35] yuvipanda: but since the train didn't roll last week we either need to cherry pick a bunch of things(probably not) or do it thursday...problem is we are getting down to the wire so maybe cherry pick some things [18:32:55] ebernhardson: yeah, +1 to cherry-picking [18:33:02] mutante: sure [18:33:49] ACKNOWLEDGEMENT - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors daniel_zahn Total Errors: 0 - fixed [18:35:17] 10Ops-Access-Requests, 6operations, 7Icinga, 5Patch-For-Review: give John Lewis permissions to send commands in icinga for fermium/mailman - https://phabricator.wikimedia.org/T105229#1735863 (10Dzahn) 5Open>3Resolved renamed the contact in private repo to match the LDAP "cn" = "John F. Lewis" and adjus... [18:35:31] 10Ops-Access-Requests, 6operations, 7Icinga: give John Lewis permissions to send commands in icinga for fermium/mailman - https://phabricator.wikimedia.org/T105229#1735871 (10Dzahn) [18:36:07] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [18:36:15] 6operations: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#1735873 (10Dzahn) [18:40:39] (03PS1) 10Chad: Generate mediawiki-installation dsh group file from hiera data [puppet] - 10https://gerrit.wikimedia.org/r/247324 [18:43:04] (03PS1) 10Aaron Schulz: Made the session/main stashes write to both DCs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247325 (https://phabricator.wikimedia.org/T111575) [18:43:57] PROBLEM - puppet last run on mw2065 is CRITICAL: CRITICAL: puppet fail [18:48:50] yuvipanda: ok, i'll try and figure out today what needs to be pulled forward [18:49:32] ebernhardson: \o/ thanks [18:50:09] yuvipanda: we think we've covered out bases such that breakage won't effect anything else, will find out :) [18:50:25] ebernhardson: :D [18:54:18] * ebernhardson should probably also put together an email about how to turn it off. one line change to mediawiki-config so not too bad :) [18:56:29] mutante: I think things will be much cleaner after the stuff in topic:dsh lands. [18:56:43] ebernhardson: what happens if for example the jobrunners can't connect to nobelium anymore? [18:56:46] for whatever reason? [18:57:49] yuvipanda: it should retry the job and then throw it away after 10 minutes [18:58:02] coool [18:58:16] yuvipanda: part of the changes we made was to make those per-cluster timeouts, and to make our write job no longer use the standard jobqueue retry (which would hold failed jobs in an abandond queue for 7 days) [18:58:36] instead it logs to a CirrusSearchChangeFailed channel, and throws it away [18:59:17] nice [18:59:37] that does sound massively less likely to take down the cluster [19:00:27] there is still some worry about graphite...but we can make that log only to fluorine if its a problem [19:05:25] (03CR) 10Smalyshev: [C: 031] Enable config for all three search clusters, but only write to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246443 (https://phabricator.wikimedia.org/T115434) (owner: 10EBernhardson) [19:10:35] 6operations, 7Database, 5Patch-For-Review: Puppetize pt-heartbeat on MariaDB10 masters and its corresponding checks on icinga - https://phabricator.wikimedia.org/T114752#1735969 (10jcrespo) p:5Triage>3Normal [19:10:57] 6operations, 7Database, 5Patch-For-Review: Puppetize pt-heartbeat on MariaDB10 masters and its corresponding checks on icinga - https://phabricator.wikimedia.org/T114752#1735970 (10jcrespo) a:3jcrespo [19:12:58] RECOVERY - puppet last run on mw2065 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:13:15] (03PS1) 10Chad: beta: Properly point upload cache to proper location [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247331 [19:13:17] (03PS1) 10Chad: beta: Start using parsoid cache 04 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247332 [19:14:56] 6operations, 10Analytics, 6Discovery, 10EventBus, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1735988 (10Ottomata) Hey yalls, I've had requests that we postpone the RFC for this one more week, until Oct 28th. I'd like for one opsen and @ori to be able to attend, and the releva... [19:22:13] (03CR) 10Krinkle: Made the session/main stashes write to both DCs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247325 (https://phabricator.wikimedia.org/T111575) (owner: 10Aaron Schulz) [19:23:55] (03CR) 10BryanDavis: [C: 04-1] admin: add bd808 to analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/247295 (https://phabricator.wikimedia.org/T115548) (owner: 10Dzahn) [19:25:08] (03PS1) 10Ori.livneh: graphite: set xFilesFactor to 0 [puppet] - 10https://gerrit.wikimedia.org/r/247334 [19:26:31] (03PS2) 10Ori.livneh: grafana: Alter dashlist in home page [puppet] - 10https://gerrit.wikimedia.org/r/247290 (owner: 10Alexandros Kosiaris) [19:26:48] (03CR) 10Ori.livneh: [C: 032 V: 032] grafana: Alter dashlist in home page [puppet] - 10https://gerrit.wikimedia.org/r/247290 (owner: 10Alexandros Kosiaris) [19:29:49] 7Puppet, 10Beta-Cluster-Infrastructure: Puppet failures across all beta caches due to *.wmflabs.org certificate - https://phabricator.wikimedia.org/T115238#1736015 (10thcipriani) [19:30:12] (03PS2) 10Ori.livneh: graphite: set a low xFilesFactor (0.01) [puppet] - 10https://gerrit.wikimedia.org/r/247334 (https://phabricator.wikimedia.org/T114974) [19:31:33] (03PS3) 10Ori.livneh: graphite: set a low xFilesFactor (0.01) [puppet] - 10https://gerrit.wikimedia.org/r/247334 (https://phabricator.wikimedia.org/T114974) [19:31:49] (03CR) 10Ori.livneh: [C: 032 V: 032] graphite: set a low xFilesFactor (0.01) [puppet] - 10https://gerrit.wikimedia.org/r/247334 (https://phabricator.wikimedia.org/T114974) (owner: 10Ori.livneh) [19:34:05] (03CR) 10Chad: [C: 032] beta: Properly point upload cache to proper location [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247331 (owner: 10Chad) [19:34:12] (03Merged) 10jenkins-bot: beta: Properly point upload cache to proper location [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247331 (owner: 10Chad) [19:34:26] greg-g, any deploy restrictions today? (fer maps-test deployments) [19:39:59] greg-g: fyi, brion and I just merged a big change to the video javascripts loading. Since it has some caching implications, it's probabably something to keep on the radar for the next train. [19:40:28] It "shouldn't" explode [19:43:47] it shouldnt, but it's TMH, which I sometimes consider worthy reblessing as TNT. :D [19:46:21] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1736061 (10Jgreen) >>! In T97676#1729945, @ellery wrote: > @awight Thanks for double checking, I understand the... [19:46:24] <_joe_> indeed. [19:49:19] PROBLEM - YARN NodeManager Node-State on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:50:57] RECOVERY - YARN NodeManager Node-State on analytics1032 is OK: OK: YARN NodeManager analytics1032.eqiad.wmnet:8041 Node-State: RUNNING [19:55:29] PROBLEM - YARN NodeManager Node-State on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:55:33] 10Ops-Access-Requests, 6operations: adding tjones to analytics-privatedata-users (hive and webrequests) - https://phabricator.wikimedia.org/T115880#1736089 (10EBernhardson) I believe tjones needs to be added to `analytics-privatedata-users` group in puppet, but IIRC otto mentioned there is something in additio... [19:58:28] 10Ops-Access-Requests, 6operations: adding tjones to analytics-privatedata-users (hive and webrequests) - https://phabricator.wikimedia.org/T115880#1736097 (10Tfinc) Approved [19:58:48] RECOVERY - YARN NodeManager Node-State on analytics1038 is OK: OK: YARN NodeManager analytics1038.eqiad.wmnet:8041 Node-State: RUNNING [20:00:04] gwicke cscott arlolra subbu bearND mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151019T2000). [20:00:22] (03CR) 10Aaron Schulz: Made the session/main stashes write to both DCs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247325 (https://phabricator.wikimedia.org/T111575) (owner: 10Aaron Schulz) [20:05:50] no mobileapps deploy today [20:10:15] 6operations, 10ops-eqiad, 10netops: Decom Tele2 @ eqiad - https://phabricator.wikimedia.org/T115712#1736119 (10faidon) BGP configuration has been deleted. Interface configuration is still there so that we can see when the link will go down. I'll handle that as the last step when the other two have been compl... [20:12:20] (03PS1) 10Dzahn: mailman: script to dump queue data to HTML [puppet] - 10https://gerrit.wikimedia.org/r/247349 [20:13:13] yurik: no out of the ordinary restrictions :) [20:13:16] thedj: weee [20:16:08] (03PS2) 10Dzahn: mailman: script to dump queue data to HTML [puppet] - 10https://gerrit.wikimedia.org/r/247349 [20:17:04] (03CR) 10Dzahn: [C: 032] mailman: script to dump queue data to HTML [puppet] - 10https://gerrit.wikimedia.org/r/247349 (owner: 10Dzahn) [20:17:11] (03PS2) 10Dzahn: bromine: Move the base::firewall includes into the roles [puppet] - 10https://gerrit.wikimedia.org/r/247262 (owner: 10Muehlenhoff) [20:18:32] (03CR) 10Dzahn: [C: 032] bromine: Move the base::firewall includes into the roles [puppet] - 10https://gerrit.wikimedia.org/r/247262 (owner: 10Muehlenhoff) [20:19:38] (03CR) 10Dzahn: "yep, noop on bromine" [puppet] - 10https://gerrit.wikimedia.org/r/247262 (owner: 10Muehlenhoff) [20:22:10] (03PS3) 10Dzahn: admin: add bd808 to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/247295 (https://phabricator.wikimedia.org/T115548) [20:22:40] (03CR) 10Dzahn: admin: add bd808 to analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/247295 (https://phabricator.wikimedia.org/T115548) (owner: 10Dzahn) [20:23:30] !log cherrypick deploy for parsoid completed: b317f33f and 60a82ae0 cherrypicked from parsoid master [20:24:22] is the bot asleep? [20:25:30] (03PS2) 10Aaron Schulz: Made the session/main stashes write to both DCs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247325 (https://phabricator.wikimedia.org/T111575) [20:26:15] anyone know about the bot .. akosiaris .. should i edit the sal page directly? [20:27:02] (03PS1) 10Ori.livneh: grafana: rename 'Dashboard' pane to 'Featured dashboards' [puppet] - 10https://gerrit.wikimedia.org/r/247453 [20:27:45] (03CR) 10Dzahn: [C: 031] "has approval, does not need to be in ops meeting. per waiting period it can merge in a couple hours or tomorrow morning" [puppet] - 10https://gerrit.wikimedia.org/r/247295 (https://phabricator.wikimedia.org/T115548) (owner: 10Dzahn) [20:27:51] 6operations, 10Beta-Cluster-Infrastructure, 6Labs, 10Labs-Infrastructure, 5WMF-NDA: On deployment-prep, add warning text + labs Term of Uses link to the motd files - https://phabricator.wikimedia.org/T100837#1736162 (10demon) [20:28:02] 6operations, 10Beta-Cluster-Infrastructure, 6Labs, 10Labs-Infrastructure, 5WMF-NDA: On deployment-prep, add warning text + labs Term of Uses link to the motd files - https://phabricator.wikimedia.org/T100837#1321787 (10demon) >>! In T100837#1701062, @Krenair wrote: > Can we make this task public? Done. [20:28:11] * subbu will edit the sal wiki page directly [20:29:00] (03CR) 10Ori.livneh: [C: 032] grafana: rename 'Dashboard' pane to 'Featured dashboards' [puppet] - 10https://gerrit.wikimedia.org/r/247453 (owner: 10Ori.livneh) [20:29:24] subbu: in case you haven't seen it: https://tools.wmflabs.org/sal/production [20:29:36] not "official" but so much better than the bot+wikipage [20:30:03] greg-g, oh .. so, then do i add it to the wikipage or not? [20:30:09] yeah [20:30:14] since it's not official (yet) [20:30:27] oh, right. [20:30:43] :) [20:30:44] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to analytics-privatedata-users for Bryan Davis - https://phabricator.wikimedia.org/T115548#1736171 (10Dzahn) Alright, so this has approval and this doesn't have to be in ops meeting. Per the waiting period rule it can be merged later to... [20:31:21] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to analytics-privatedata-users for Bryan Davis - https://phabricator.wikimedia.org/T115548#1736172 (10Dzahn) p:5Triage>3Normal [20:34:19] chasemp, yuvipanda: fyi scheduled a deploy window in 1.5 hours (3pm PST) to send out all the multi-dc cherry picks, incase you want to just watch things or whatever [20:35:09] Ok thanks, I would [20:35:13] ebernhardson: cool. I'll be around too! [20:39:17] (03PS3) 10Chad: Log OOM rate and HHVM-non-OOM error rate in statds for graphing [puppet] - 10https://gerrit.wikimedia.org/r/246409 [20:39:28] (03PS1) 10Dzahn: admin: add tjones to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/247455 (https://phabricator.wikimedia.org/T115880) [20:40:08] (03CR) 10Chad: [C: 031] "Been running in beta. HHVM data has been getting logged fine. OOM should be logging fine, but we don't seem to have enough OOM events for " [puppet] - 10https://gerrit.wikimedia.org/r/246409 (owner: 10Chad) [20:41:04] (03CR) 10Dzahn: "ottomata: can you comment on "otto mentioned there is something in addition that needs" [puppet] - 10https://gerrit.wikimedia.org/r/247455 (https://phabricator.wikimedia.org/T115880) (owner: 10Dzahn) [20:41:41] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: adding tjones to analytics-privatedata-users (hive and webrequests) - https://phabricator.wikimedia.org/T115880#1736197 (10Dzahn) p:5Triage>3Normal [20:45:38] PROBLEM - puppet last run on mc2009 is CRITICAL: CRITICAL: puppet fail [20:45:57] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: adding tjones to analytics-privatedata-users (hive and webrequests) - https://phabricator.wikimedia.org/T115880#1736222 (10Dzahn) >>! In T115880#1736089, @EBernhardson wrote: > IIRC otto mentioned there is something in addition that needs to be done so h... [20:46:08] 6operations, 10ops-eqiad, 10netops: remove tele2(patchid 2953) from dmarc panel - https://phabricator.wikimedia.org/T115921#1736224 (10RobH) 3NEW a:3Cmjohnson [20:46:26] (03PS1) 10Milimetric: Aggregate from projectviews-*, not projectcounts-* [puppet] - 10https://gerrit.wikimedia.org/r/247458 (https://phabricator.wikimedia.org/T114379) [20:49:15] 6operations, 10ops-eqiad, 10netops: Decom Tele2 @ eqiad - https://phabricator.wikimedia.org/T115712#1736235 (10RobH) I've submitted the de-install request for this link via the equinix portal. It lists all cross-connects in cabinet 0000 for our cage, and the Tele2 link had the cable ID in its information (a... [20:50:53] survey if more surveys [20:52:29] ^ who fancies doing that? sounds like a good quarterly goal :p [20:58:22] 10Ops-Access-Requests, 6operations: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1736285 (10Dzahn) Hi @atgo Do you have a [[ https://wikitech.wikimedia.org/wiki/Main_Page | wikitech ]] user yet? If not, please create one there and let me kno... [20:58:52] 10Ops-Access-Requests, 6operations: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1736287 (10Dzahn) p:5Triage>3Normal [20:59:59] 6operations, 10Analytics, 6Discovery, 10EventBus, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1736288 (10Ottomata) So, we need to be really careful here. This MVP as of yet has zero buy in from anyone in ops. In addition, both @ori and @eevans point out that EventLogging alread... [21:00:36] 10Ops-Access-Requests, 6operations: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1736289 (10atgo) Hi @dzahn - here's my Wikitech acct: https://wikitech.wikimedia.org/wiki/User:Atgomez I'll work on Lisa. [21:01:51] 10Ops-Access-Requests, 6operations: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1736291 (10atgo) @krenair yes I have LDAP - not sure how to find more information. I think my ideal setup/access here should be parallel to what @jkatzwmf has in... [21:06:31] 6operations, 10netops, 10procurement: Decom Tele2 @ eqiad - https://phabricator.wikimedia.org/T115712#1736296 (10faidon) [21:06:35] 10Ops-Access-Requests, 6operations: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1736298 (10Dzahn) @atgo ok, thanks. The wikitech user and the LDAP user are the same here. I have the UID now which i needed to create a patch for this, so that... [21:10:09] 6operations, 10ops-eqiad, 10netops: Return psw2-eqiad to spares - https://phabricator.wikimedia.org/T115924#1736309 (10faidon) 3NEW a:3Cmjohnson [21:12:39] 6operations, 10Beta-Cluster-Infrastructure, 6Labs, 10Labs-Infrastructure: On deployment-prep, add warning text + labs Term of Uses link to the motd files - https://phabricator.wikimedia.org/T100837#1736320 (10Krenair) [21:12:49] RECOVERY - puppet last run on mc2009 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [21:13:58] PROBLEM - puppet last run on mw1190 is CRITICAL: CRITICAL: Puppet has 1 failures [21:21:59] (03PS3) 10Aaron Schulz: Made the session/main stashes write to both DCs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247325 (https://phabricator.wikimedia.org/T111575) [21:24:38] 10Ops-Access-Requests, 6operations: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1736342 (10atgo) Read and signed. I'll attach a new key shortly... I'm on a loaner computer right now since mine is having charging problems. [21:25:23] chasemp: around and working? [21:26:08] Just walked in from being at the applestore, give me 5 minutes? [21:26:14] k : [21:26:15] :) [21:26:23] oh right, how's the laptop situation..? [21:27:51] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: adding tjones to analytics-privatedata-users (hive and webrequests) - https://phabricator.wikimedia.org/T115880#1736346 (10EBernhardson) I went back and looked through the patches for adding me to hive, from T109356. Based on the history of patches there... [21:30:51] worked out I think, I had some "adverse power event" and all it needed was https://support.apple.com/en-us/HT201295 [21:31:11] assuming it doesn't happen again, they basically said no way to really know at this point [21:41:09] RECOVERY - puppet last run on mw1190 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:43:26] (03PS1) 10Dzahn: admin: create agomez and add to stats-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/247467 (https://phabricator.wikimedia.org/T115666) [21:43:56] (03CR) 10jenkins-bot: [V: 04-1] admin: create agomez and add to stats-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/247467 (https://phabricator.wikimedia.org/T115666) (owner: 10Dzahn) [21:45:25] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1736385 (10Dzahn) ok, cool, i uploaded a change and can amend it anytime with the right key [21:58:52] 6operations: Ferm rules for palladium - https://phabricator.wikimedia.org/T113344#1736445 (10Dzahn) p:5Triage>3Normal [22:00:36] 6operations, 10Wikimedia-General-or-Unknown, 7Database: hewiki's categorylinks shown as not empty though it is; purging does not help - https://phabricator.wikimedia.org/T115682#1736476 (10Dzahn) What is needed from ops here if anything? [22:00:39] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [22:03:53] mutante: Nothing.. I think resolve as dupe per Aaron [22:04:26] Reedy: ok :) [22:04:32] thanks [22:05:48] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [22:06:09] (03PS2) 10Rush: iridium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247201 (owner: 10Muehlenhoff) [22:06:23] (03CR) 10Rush: [C: 031] "no objections thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/247201 (owner: 10Muehlenhoff) [22:07:53] 6operations: Create an upload queue for reprepro - https://phabricator.wikimedia.org/T115349#1736500 (10Dzahn) p:5Triage>3Normal [22:08:38] (03CR) 10Dzahn: "why ntp and diamond, don't those come from base anyways?" [puppet] - 10https://gerrit.wikimedia.org/r/247201 (owner: 10Muehlenhoff) [22:09:31] !log salt-run deluser --delete-home gmetric; delgroup systemusers [22:10:10] no bot? [22:11:07] hrmm, it's here though [22:12:33] logmsgbot but morebots is not [22:13:00] looks [22:14:51] killed and restarted that on tool labs [22:15:08] but there is no job id for it yet [22:15:32] tools.morebots@tools-bastion-01:~$ jstart -N production-logbot /usr/lib/adminbot/adminlogbot.py --config ./confs/production-logbot.py [22:15:45] per wikitech docs [22:16:17] "qstat" doesn't show it, only the labs- analytics- and qa- bots [22:17:01] tools.morebots@tools-bastion-01:~$ jstart -N production-logbot /usr/lib/adminbot/adminlogbot.py --config ./confs/production-logbot.pyYour job 711019 ("production-logbot") has been submitted [22:17:13] Your job 711019 ("production-logbot") has been submitted [22:17:16] !log salt-run deluser --delete-home gmetric; delgroup systemusers [22:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:17:43] ok [22:17:52] TIL there is a native deluser for salt [22:18:51] (03PS6) 10EBernhardson: Enable config for all three search clusters, but only write to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246443 (https://phabricator.wikimedia.org/T115434) [22:19:14] chasemp: oh? good for offboarding? [22:19:44] sure probably, I don't know if it's advantageous over salt.cmd and whatever [22:19:49] but it's more readable [22:20:27] !log sync-common on mw1017 to pre-test cirrussearch multi-dc deployment [22:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:24:51] 6operations: Remove user from fr-online list - https://phabricator.wikimedia.org/T115935#1736579 (10Krenair) [22:25:47] chasemp: *nod*, nice [22:25:47] RECOVERY - WDQS SPARQL on wdqs1001 is OK: HTTP OK: HTTP/1.1 200 OK - 4518 bytes in 0.006 second response time [22:26:47] RECOVERY - WDQS HTTP on wdqs1001 is OK: HTTP OK: HTTP/1.1 200 OK - 4518 bytes in 0.013 second response time [22:28:45] !log ebernhardson@tin Synchronized php-1.27.0-wmf.2/extensions/Elastica/: Deploy multi-dc cirrusearch code for Elastica extension (duration: 00m 17s) [22:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:28:52] chasemp, yuvipanda|maybe: syncing out the multidc code (still turned off). Will be sending out the config patch in a few [22:29:29] !log ebernhardson@tin Synchronized php-1.27.0-wmf.2/extensions/CirrusSearch/: Deploy multi-dc cirrusearch code for CirrusSearch extension (duration: 00m 18s) [22:29:35] ebernhardson: which changesets does that include? [22:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:30:15] chasemp: https://gerrit.wikimedia.org/r/#/q/owner:self+branch:wmf/1.27.0-wmf.2,n,z all the ones merged after 3pm [22:30:21] gotcha [22:30:31] oops, that says owner:self :) use owner:ebernhardson [22:30:32] well that shows nothing for me :) [22:30:34] ha [22:31:15] actually, hmm i think i need one more sec [22:31:38] PROBLEM - Apache HTTP on mw1231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:31:41] 6operations: Remove user from fr-online list - https://phabricator.wikimedia.org/T115935#1736599 (10Dzahn) p:5Triage>3Normal [22:31:47] 6operations: Remove user from fr-online list - https://phabricator.wikimedia.org/T115935#1736601 (10Dzahn) a:3Dzahn [22:31:57] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [22:32:00] PROBLEM - HHVM rendering on mw1231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:32:24] (03PS1) 10John F. Lewis: mailman: add cron to gather queue data [puppet] - 10https://gerrit.wikimedia.org/r/247472 (https://phabricator.wikimedia.org/T114861) [22:32:51] mutante: ^^ [22:33:46] 6operations, 5Patch-For-Review: mailman check_queue recurrent alarm/recovery - https://phabricator.wikimedia.org/T114861#1736607 (10JohnLewis) a:5Dzahn>3JohnLewis Stealing assign. [22:36:28] 6operations: restore old mw config private svn repo from bacula - https://phabricator.wikimedia.org/T115937#1736613 (10Dzahn) 3NEW a:3Dzahn [22:36:44] Reedy: ^ that's what you wanted and didnt work last time i tried, right [22:36:53] JohnFLewis: yep, in a minute, thx [22:37:01] 6operations, 6Phabricator, 7audits-data-retention: Enable mod_remoteip on Phabricator and ensure logs follow retention guidelines - https://phabricator.wikimedia.org/T114014#1736621 (10chasemp) [22:37:52] mutante: ja, if you can get the source repo ideally :D [22:38:23] 6operations: restore old mw config private svn repo from bacula - https://phabricator.wikimedia.org/T115937#1736624 (10Dzahn) [22:39:48] 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1736632 (10chasemp) >>! In T100519#1717447, @mmodell wrote: > @chasemp: Is there anything remaining for this to be completed? Feel free to claim and close this task.... [22:40:43] :) [22:40:48] !log ebernhardson@tin Synchronized php-1.27.0-wmf.2/extensions/CirrusSearch/: Handle ElasticaWrite job failures internally (duration: 00m 18s) [22:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:41:37] (03CR) 10EBernhardson: [C: 032] Enable config for all three search clusters, but only write to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246443 (https://phabricator.wikimedia.org/T115434) (owner: 10EBernhardson) [22:41:45] ebernhardson: \i/ [22:41:47] err [22:41:50] \o/ [22:42:05] (03Merged) 10jenkins-bot: Enable config for all three search clusters, but only write to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246443 (https://phabricator.wikimedia.org/T115434) (owner: 10EBernhardson) [22:42:25] 6operations: Remove user from fr-online list - https://phabricator.wikimedia.org/T115935#1736660 (10Dzahn) Hi @bcampbell This is done. I removed vshchepakina from the fr-online mail alias. P.S. (bcampbell confirmed per https://meta.wikimedia.org/w/index.php?title=Special%3ALog&type=&user=&page=User%3ABCampbe... [22:42:40] 6operations: Remove user from fr-online list - https://phabricator.wikimedia.org/T115935#1736661 (10Dzahn) 5Open>3Resolved [22:42:53] !log ebernhardson@tin Synchronized wmf-config/: Enable cirrusearch multi cluster configuration, only write to eqiad (duration: 00m 18s) [22:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:46:02] ok multi-dc search looks to be ok, although its not populating anything yet. Our next steps are to copy all the indices from the main cluster to the new cluster (via the newly deployed code), once the indices exist and are populated we can turn normal writes on [22:46:50] 6operations, 5Patch-For-Review: mailman check_queue recurrent alarm/recovery - https://phabricator.wikimedia.org/T114861#1736693 (10Dzahn) I wrote a script for this: https://gerrit.wikimedia.org/r/#/c/247349/ that is now used jby the cron job John added: https://gerrit.wikimedia.org/r/#/c/247472/1 [22:46:54] (03CR) 10Dzahn: [C: 032] mailman: add cron to gather queue data [puppet] - 10https://gerrit.wikimedia.org/r/247472 (https://phabricator.wikimedia.org/T114861) (owner: 10John F. Lewis) [22:47:05] 6operations: mailman check_queue recurrent alarm/recovery - https://phabricator.wikimedia.org/T114861#1736709 (10Dzahn) [22:47:16] 6operations, 10Wikimedia-Mailing-lists: mailman check_queue recurrent alarm/recovery - https://phabricator.wikimedia.org/T114861#1708040 (10Dzahn) [22:47:46] mutante: if you can output the cron committed so I can be happy and head off - that'll be awesome :) [22:48:01] since I can't see root's cron (once puppet is ran of course) [22:49:50] 6operations, 10Wikimedia-Mailing-lists: mailman check_queue recurrent alarm/recovery - https://phabricator.wikimedia.org/T114861#1736717 (10Dzahn) on fermium: Notice: /Stage[main]/Mailman::Cron/Cron[queue_data]/ensure: created # Puppet Name: queue_data 2 * * * * /usr/local/sbin/queue_data -a >> /var/www/qdata... [22:49:52] JohnFLewis: ^ :) [22:50:07] awesome [22:50:14] on that note, night then :) [22:50:28] ok, good night! [22:51:21] (03PS2) 10Dzahn: zuul: Add zuul-test-repo helper script [puppet] - 10https://gerrit.wikimedia.org/r/247031 (owner: 10Legoktm) [22:51:44] (03CR) 10Dzahn: [C: 032] zuul: Add zuul-test-repo helper script [puppet] - 10https://gerrit.wikimedia.org/r/247031 (owner: 10Legoktm) [22:52:43] mutante: thanks :) [22:52:59] PROBLEM - puppet last run on erbium is CRITICAL: CRITICAL: Puppet has 1 failures [22:53:21] legoktm: np [22:53:43] erbium, i dont believe it, but checking [22:54:13] oh, it's real [22:54:30] lol [22:54:35] 6operations, 6Release-Engineering-Team: deployment: user trebuchet gets added and removed from group wikidev on every puppet run - https://phabricator.wikimedia.org/T115760#1736723 (10chasemp) [22:54:46] 6operations, 6Release-Engineering-Team: deployment: user trebuchet gets added and removed from group wikidev on every puppet run - https://phabricator.wikimedia.org/T115760#1732096 (10chasemp) @thcipriani Can we remove trebuchet user from wikidev all together? [22:54:48] Could not create user file_mover: Execution of '/usr/sbin/useradd -g 30001 [22:54:58] anyone working on that file_mover thing? [22:55:23] useradd: group '30001' does not exist [22:55:41] uh interesting [22:57:39] why now:) [22:57:40] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [22:57:46] nobody logged in in October [22:59:12] 6operations, 5Continuous-Integration-Scaling: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1736733 (10chasemp) We actually decided on this at the offsite :) This box can go in labs-support and there is no issue w/ current releng permissions translating. It wil... [22:59:48] hmm, did i fail to sync something out? checking [23:00:02] i dont see an erbium related change either [23:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151019T2300). Please do the needful. [23:00:29] (waiting for ebernhardson) [23:00:30] ok it was me, it wants me to sync out changes to the test dir. I should have realised that [23:01:01] !log ebernhardson@tin Synchronized tests/: noop sync mediawiki-config test dir (duration: 00m 17s) [23:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:01:15] iiuc that should silence the warning [23:01:34] oh, actually, you have patches in this swat ebernhardson [23:01:36] want to take it? [23:01:40] sure, i can do that [23:02:45] Luke081515|away: available to test your patch when i ship it out in SWAT? [23:02:50] * ebernhardson is going to guess not, but never know :) [23:03:01] it's a throttle exemption ebernhardson [23:03:29] it looks sane, but i've never seen that bit of code before [23:03:29] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (100586s 100000s) [23:03:47] 6operations, 5Continuous-Integration-Scaling: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1736740 (10chasemp) a:5chasemp>3RobH >>! In T95046#1228381, @RobH wrote: > I'm going to assign this to chase, only while the discussion is pending about the networking... [23:04:16] 6operations, 6Analytics-Engineering: erbium (logging) - useradd: group '30001' does not exist - https://phabricator.wikimedia.org/T115943#1736743 (10Dzahn) 3NEW [23:04:40] ebernhardson, you realise that usually the people writing throttle.php patches have no real way to test them right? [23:04:43] 6operations, 6Analytics-Engineering: erbium (logging) - useradd: group '30001' does not exist - https://phabricator.wikimedia.org/T115943#1736753 (10Dzahn) [23:05:00] Krenair: as i said i've never seen it before, the answer is no i have no clue what our test infra looks like there :) [23:05:08] 6operations, 6Analytics-Engineering: erbium (logging) - useradd: group '30001' does not exist - https://phabricator.wikimedia.org/T115943#1736743 (10Dzahn) @ottomata do you know anything about this? [23:05:26] !log ebernhardson@tin Synchronized php-1.27.0-wmf.2/extensions/WikimediaEvents/: Bump sampling rate of common terms test from 1:1000 to 1:200 (duration: 00m 17s) [23:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:05:41] (03CR) 10EBernhardson: [C: 032] Disable completion suggester experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247303 (owner: 10EBernhardson) [23:05:50] but if you think its fine I'll ship it, and blame you if anything goes wrong ;) [23:05:55] (its probably fine) [23:05:57] ACKNOWLEDGEMENT - puppet last run on erbium is CRITICAL: CRITICAL: Puppet has 1 failures daniel_zahn https://phabricator.wikimedia.org/T115943 [23:06:03] (03Merged) 10jenkins-bot: Disable completion suggester experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247303 (owner: 10EBernhardson) [23:06:24] 6operations, 6Analytics-Engineering: erbium (logging) - useradd: group '30001' does not exist - https://phabricator.wikimedia.org/T115943#1736765 (10Dzahn) a:3Ottomata [23:06:30] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [23:07:23] 6operations, 6Analytics-Backlog: erbium (logging) - useradd: group '30001' does not exist - https://phabricator.wikimedia.org/T115943#1736766 (10madhuvishy) [23:07:28] !log ebernhardson@tin Synchronized wmf-config/: Disable cirrus suggester AB test (duration: 00m 17s) [23:07:32] 6operations, 6Labs, 10Labs-Infrastructure, 10Wikimedia-Apache-configuration, and 2 others: wikitech-static sync broken - https://phabricator.wikimedia.org/T101803#1736767 (10Dzahn) 5Resolved>3Open reopening, it's CRIT again on https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=silver&... [23:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:08:05] ACKNOWLEDGEMENT - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (100850s 100000s) daniel_zahn https://phabricator.wikimedia.org/T101803 [23:08:13] (03CR) 10EBernhardson: [C: 032] Add throttle exception for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246709 (https://phabricator.wikimedia.org/T115632) (owner: 10Luke081515) [23:08:35] (03Merged) 10jenkins-bot: Add throttle exception for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246709 (https://phabricator.wikimedia.org/T115632) (owner: 10Luke081515) [23:08:49] !log restarted apache on mw1231 [23:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:09:36] !log ebernhardson@tin Synchronized wmf-config/throttle.php: Add throttle exception for dewiki (duration: 00m 17s) [23:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:11:01] (03PS1) 10EBernhardson: Revert "Enable config for all three search clusters, but only write to eqiad" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247476 [23:11:07] (03CR) 10EBernhardson: [C: 032] Revert "Enable config for all three search clusters, but only write to eqiad" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247476 (owner: 10EBernhardson) [23:11:18] RECOVERY - Apache HTTP on mw1231 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.090 second response time [23:11:23] YuviPanda, chasemp: some wierd errors coming out only for commonswiki :S reverting the config patch till i know why [23:11:35] :( ok [23:11:37] ok thanks [23:11:38] RECOVERY - HHVM rendering on mw1231 is OK: HTTP OK: HTTP/1.1 200 OK - 71013 bytes in 0.637 second response time [23:11:40] (03PS2) 10Dzahn: bast4001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246962 (owner: 10Muehlenhoff) [23:11:44] (03Merged) 10jenkins-bot: Revert "Enable config for all three search clusters, but only write to eqiad" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247476 (owner: 10EBernhardson) [23:11:50] !log restarted hhvm on mw1231 [23:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:12:25] !log ebernhardson@tin Synchronized wmf-config/: Revert multidc cirrussearch config, seeing unexplained errors on commonswiki (duration: 00m 18s) [23:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:13:02] (03CR) 10Dzahn: [C: 032] bast4001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246962 (owner: 10Muehlenhoff) [23:13:41] !log ebernhardson@tin Synchronized wmf-config/: resync after touching InitialiseSettings.php to bust caches (duration: 00m 18s) [23:13:44] (03PS2) 10Dzahn: argon: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246960 (owner: 10Muehlenhoff) [23:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:14:27] 6operations: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#1736804 (10Dzahn) p:5Triage>3High [23:14:40] 6operations, 10netops: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#1721704 (10Dzahn) [23:14:49] (03PS1) 10Yurik: testwiki Graphoid to restbase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247477 [23:15:08] 6operations, 10ops-eqiad, 7Swift: [determine] rack ms-be1019-1021 - https://phabricator.wikimedia.org/T114711#1736806 (10Dzahn) p:5Triage>3Normal [23:15:47] 6operations, 6Release-Engineering-Team: deployment: user trebuchet gets added and removed from group wikidev on every puppet run - https://phabricator.wikimedia.org/T115760#1736808 (10chasemp) a:5chasemp>3thcipriani >>! In T115760#1736723, @chasemp wrote: > @thcipriani > > Can we remove trebuchet user f... [23:16:29] ebernhardson, are you deplaying SWAT? [23:18:46] 6operations, 7HTTPS: status.wikimedia.org is using SSL cert from other domain - https://phabricator.wikimedia.org/T34796#1736817 (10MZMcBride) Re-opening this for further consideration. A fair bit has changed since 2013, including a strong push for HTTPS/TLS/SSL support across both Wikimedia and the rest of th... [23:19:18] 6operations, 7HTTPS: status.wikimedia.org is using SSL cert from other domain - https://phabricator.wikimedia.org/T34796#1736819 (10MZMcBride) [23:19:47] 6operations, 7HTTPS: status.wikimedia.org is using SSL cert from other domain - https://phabricator.wikimedia.org/T34796#1736824 (10MZMcBride) 5declined>3Open [23:20:49] 6operations, 7HTTPS: status.wikimedia.org is using SSL cert from other domain - https://phabricator.wikimedia.org/T34796#1736826 (10Dzahn) Agreed, let's consider buying that. Adding "traffic" for opinions. [23:21:04] 6operations, 10Traffic, 7HTTPS: status.wikimedia.org is using SSL cert from other domain - https://phabricator.wikimedia.org/T34796#1736827 (10Dzahn) [23:21:41] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 10hardware-requests: Assign 3 more servers to video scaler duty - https://phabricator.wikimedia.org/T114337#1736829 (10Dzahn) [23:21:43] * yurik pinging ebernhardson - could you deploy https://gerrit.wikimedia.org/r/#/q/247477,n,z as part of this swat? I already added it to the swat request [23:21:48] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 10hardware-requests: Assign 3 more servers to video scaler duty - https://phabricator.wikimedia.org/T114337#1691895 (10Dzahn) added "hardware-requests" [23:21:57] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 10hardware-requests: Assign 3 more servers to video scaler duty - https://phabricator.wikimedia.org/T114337#1736834 (10Dzahn) p:5Triage>3Normal [23:22:14] 6operations, 5Patch-For-Review: Do not require people to be explicitly added to the bastiononly group - https://phabricator.wikimedia.org/T114161#1736835 (10Dzahn) p:5Triage>3Normal [23:22:52] 6operations, 5Patch-For-Review: ssl expiry tracking in icinga - we don't monitor that many domains - https://phabricator.wikimedia.org/T114059#1736840 (10Dzahn) p:5Triage>3High [23:23:24] 6operations, 10MediaWiki-extensions-BounceHandler: Need an administrative front end for BounceHandler - https://phabricator.wikimedia.org/T114020#1736843 (10Dzahn) p:5Triage>3Normal [23:23:54] 6operations: Booleans in hiera may be harmful - https://phabricator.wikimedia.org/T114018#1736847 (10Dzahn) p:5Triage>3Normal [23:24:28] (03PS1) 10EBernhardson: Revert "Revert "Enable config for all three search clusters, but only write to eqiad"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247478 [23:24:32] (03CR) 10Dzahn: "noop as expected" [puppet] - 10https://gerrit.wikimedia.org/r/246962 (owner: 10Muehlenhoff) [23:24:40] (03CR) 10Dzahn: [C: 032] argon: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246960 (owner: 10Muehlenhoff) [23:25:51] 6operations: How to page when a host is down? - https://phabricator.wikimedia.org/T113834#1736852 (10Dzahn) @andrew Did that answer the question or should the ticket stay open? [23:26:01] 6operations: How to page when a host is down? - https://phabricator.wikimedia.org/T113834#1736853 (10Dzahn) p:5Triage>3Normal [23:26:08] 6operations: How to page when a host is down? - https://phabricator.wikimedia.org/T113834#1736854 (10Dzahn) a:3Andrew [23:27:24] 6operations: scap to snapshot1001 failing due to full disk - https://phabricator.wikimedia.org/T113888#1736857 (10Dzahn) I see enough disk space on snapshot1001 now. Don't know how it was resolved, but it looks it is. [23:27:31] 6operations: scap to snapshot1001 failing due to full disk - https://phabricator.wikimedia.org/T113888#1736858 (10Dzahn) p:5Triage>3High [23:27:39] 6operations, 10Traffic, 7HTTPS: status.wikimedia.org is using SSL cert from other domain - https://phabricator.wikimedia.org/T34796#1736860 (10Reedy) Over 2 years later, and we still have pages like status.watchmouse.com giving ``` This server could not prove that it is status.watchmouse.com; its security c... [23:28:29] 6operations: scap to snapshot1001 failing due to full disk - https://phabricator.wikimedia.org/T113888#1736868 (10Dzahn) 5Open>3Resolved a:3Dzahn Filesystem Size Used Avail Use% Mounted on /dev/sda1 59G 25G 32G 44% / .. dataset1001.wikimedia.org:/data 5... [23:28:32] (03PS1) 10Papaul: Removed mgmt DNS for virt20[0-1][1-9],pc200[1-3],labsdb200[1-3]and WMF5709 Bug:T115372 [dns] - 10https://gerrit.wikimedia.org/r/247480 (https://phabricator.wikimedia.org/T115372) [23:28:56] 6operations: scap to snapshot1001 failing due to full disk - https://phabricator.wikimedia.org/T113888#1736872 (10Dzahn) looks like it was reinstalled, yep [23:30:21] (03PS2) 10Alex Monk: Removed mgmt DNS for virt20[0-1][1-9], pc200[1-3], labsdb200[1-3] and WMF5709 [dns] - 10https://gerrit.wikimedia.org/r/247480 (https://phabricator.wikimedia.org/T115372) (owner: 10Papaul) [23:30:46] YuviPanda: is there a specific thing that ops needs to do for https://phabricator.wikimedia.org/T113571 ? [23:31:10] mutante: no idea, bd808 would know better [23:31:28] YuviPanda: in that case we can probably remove the operations tag [23:31:47] mutante: I don't know how that makes sense, but really, you should be talking to bd808 and not me. [23:31:57] 'Yuvi does not know what operations can do for this' does not imply anything :) [23:32:31] I removed the ops tag [23:32:40] ok, thanks [23:32:58] robh, godog: hey [23:33:08] ? [23:33:16] A few days ago I found tmh2* entries in DNS [23:33:24] for mgmt [23:33:32] But I thought we were not going to use tmh* [23:33:44] I've found https://phabricator.wikimedia.org/T84823 and https://phabricator.wikimedia.org/T84812 [23:34:05] They're old [23:34:06] :P [23:34:08] 6operations: Remove user from fr-online list - https://phabricator.wikimedia.org/T115935#1736882 (10bcampbell) Thank you! -Brendan [23:34:10] (the tickets!) [23:34:12] they were likely made as a mistake then, or likely before the decision was made? [23:34:21] hm, was it actually _joe_ I asked about? [23:34:24] it was in tampa as tmh, and likely simply renamed to tmh2* [23:34:25] about it* [23:34:39] 6operations, 10MediaWiki-extensions-BounceHandler: Need an administrative front end for BounceHandler - https://phabricator.wikimedia.org/T114020#1736884 (10Legoktm) We can send the data to graphite and put up some graphs on grafana. [23:34:39] but yea, lets make a task to reclaim to spares, lessee [23:34:44] _joe_ did the reinstalls [23:34:54] robh: I think they want reinstalling under mw numbers [23:35:01] for tmh1* yeah [23:35:05] im making a ticket [23:35:20] well, we may not need that many more apaches in codfw, and i need to ensure they are idenical to the mw* there [23:35:33] so either way i'll have to track down a bit more than 5m of info is all, hence task =] [23:35:39] pfft [23:35:42] there are already videoscalers in codfw [23:35:42] i'll note the others became mw* that is useful =] [23:35:44] you should know all this [23:35:45] they're just not called tmh* [23:35:58] Are they just orphaned dns entries? [23:36:01] no [23:36:03] the systems exist [23:36:10] so mgmt exists. [23:36:15] I can't actually ping those mgmt ips? [23:36:15] Reedy: are you trolling me? [23:36:44] robh: the pfft and the "you should know all this" was :P [23:36:49] yep, they were never installed, but they have hostnames assigned to bare metal [23:37:07] so i have to investigate the bare metal for reallocation, hence my making a task ;] [23:37:12] k [23:37:18] getting shit done, yo [23:37:42] bah [23:37:55] well, whoever renamed these [23:38:02] 6operations, 10Traffic, 7Pybal: pybal-related issue on host start can break service IPs... - https://phabricator.wikimedia.org/T113597#1736909 (10Dzahn) p:5Triage>3High [23:38:08] didnt put in a task to get the actual physical labels changed (the eqiad tmh) [23:38:13] >_< [23:39:12] is anyone still swating? [23:39:34] 6operations: reclaim tmh2* as spares or into mw* pool - https://phabricator.wikimedia.org/T115950#1736931 (10RobH) 3NEW a:3RobH [23:40:58] 6operations, 10ops-eqiad: relabel tmh1001/mw1259 & tmh1001/tmh1002 - https://phabricator.wikimedia.org/T115952#1736950 (10RobH) 3NEW a:3Cmjohnson [23:41:27] 6operations: reclaim tmh2* as spares or into mw* pool - https://phabricator.wikimedia.org/T115950#1736961 (10Krenair) (See also T84823 and T84812) [23:41:42] is that second ticket able to be closed due to this robh? [23:41:47] yurik: i was, but swat finished quickly [23:41:53] goddamn it i made it a subtask of the wrong task [23:41:55] blargggg [23:42:04] yurik: still need it deployed? the window is open [23:42:06] Krenair: i dunno what second ticket you mean [23:42:10] ebernhardson, yes pls [23:42:14] that I linked [23:42:21] in the comment [23:42:30] (03CR) 10EBernhardson: [C: 032] testwiki Graphoid to restbase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247477 (owner: 10Yurik) [23:42:52] (03Merged) 10jenkins-bot: testwiki Graphoid to restbase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247477 (owner: 10Yurik) [23:43:28] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: testwiki Graphoid to restbase (duration: 00m 17s) [23:43:32] yurik: ^^ [23:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:43:45] ebernhardson, awesome! testing... [23:43:50] indeed, declining [23:43:56] since its now linked to the new one for reference [23:43:57] 6operations, 6Services, 5Patch-For-Review, 7RESTBase-architecture: Separate /var on restbase100x - https://phabricator.wikimedia.org/T113714#1736983 (10Dzahn) p:5Triage>3Normal [23:44:02] ebernhardson, if it works, we could deploy it for everyone :)) [23:44:19] although i should probably wait til tomorrow [23:45:20] yep, works well, ebernhardson thx! [23:45:35] 6operations, 10ops-eqiad: relabel tmh1001/mw1259 & tmh1002/mw1260 - https://phabricator.wikimedia.org/T115952#1736993 (10Krenair) [23:46:46] 6operations, 10Salt: salt still has issues with grain selection? - https://phabricator.wikimedia.org/T114937#1737000 (10Dzahn) p:5Triage>3Normal [23:46:57] 6operations, 7Icinga: make critical icinga services always send email but keep honoring timezones for pages - https://phabricator.wikimedia.org/T114661#1737007 (10Dzahn) a:3Dzahn [23:47:23] 7Puppet, 6operations, 5Patch-For-Review: Add the puppet CA to the certification authorities trusted by our systems, on demand - https://phabricator.wikimedia.org/T114638#1737015 (10Dzahn) p:5Triage>3Normal [23:47:25] 6operations: videoscaler naming conventions - https://phabricator.wikimedia.org/T105009#1737017 (10Krenair) a:5Krenair>3Joe [23:48:06] 6operations: reclaim tmh2* as spares or into mw* pool - https://phabricator.wikimedia.org/T115950#1737019 (10RobH) p:5Triage>3Normal Indeed, thx for the linking (I've now resolved the second of the two due to link ;) [23:48:32] 6operations, 6Labs, 5Patch-For-Review: Backport python-ldap3 package from Utopic to Precise / Trusty - https://phabricator.wikimedia.org/T101824#1737022 (10Dzahn) p:5Triage>3Normal What's still missing? [23:50:10] 6operations, 6Labs: Backport python-ldap3 package from Utopic to Precise / Trusty - https://phabricator.wikimedia.org/T101824#1737025 (10Krenair) [23:51:39] (03CR) 10Dzahn: [C: 031] "Faidon's concerns have been addressed. role spare includes base::firewall now" [puppet] - 10https://gerrit.wikimedia.org/r/246831 (owner: 10Muehlenhoff) [23:51:58] (03PS2) 10Dzahn: Mark multatuli as spare [puppet] - 10https://gerrit.wikimedia.org/r/246831 (owner: 10Muehlenhoff) [23:52:17] (03CR) 10Dzahn: [C: 031] Mark multatuli as spare [puppet] - 10https://gerrit.wikimedia.org/r/246831 (owner: 10Muehlenhoff) [23:53:19] (03PS2) 10Dzahn: ntp: do not 'ensure latest' [puppet] - 10https://gerrit.wikimedia.org/r/247005 (https://phabricator.wikimedia.org/T115348) [23:53:38] 6operations: reclaim tmh2* as spares or into mw* pool - https://phabricator.wikimedia.org/T115950#1737032 (10RobH) These are identical to the other mw* systems in codfw, so it makes sense to simply append these to the end of the mw system range and use them. (I'll create the onsite tasks shortly, as well as the... [23:54:19] (03CR) 10Dzahn: [C: 032] "no surprise upgrades for base stuff like this" [puppet] - 10https://gerrit.wikimedia.org/r/247005 (https://phabricator.wikimedia.org/T115348) (owner: 10Dzahn) [23:55:43] (03PS4) 10Dzahn: admin: add new group for datacenter ops [puppet] - 10https://gerrit.wikimedia.org/r/246848 (https://phabricator.wikimedia.org/T115718) [23:56:55] (03CR) 10Dzahn: [C: 032] "adding new empty group only" [puppet] - 10https://gerrit.wikimedia.org/r/246848 (https://phabricator.wikimedia.org/T115718) (owner: 10Dzahn) [23:58:59] (03CR) 10Dzahn: [C: 04-2] admin: add dc-ops group to role access_new_install [puppet] - 10https://gerrit.wikimedia.org/r/246850 (https://phabricator.wikimedia.org/T115718) (owner: 10Dzahn) [23:59:36] 7Blocked-on-Operations, 10Ops-Access-Requests, 6operations, 3Discovery-Maps-Sprint, and 2 others: Kartotherian service logs inaccessible (systemd?) and not updated (/var/log) - https://phabricator.wikimedia.org/T115067#1737059 (10Dzahn) [23:59:55] 7Blocked-on-Operations, 10Ops-Access-Requests, 6operations, 3Discovery-Maps-Sprint, and 2 others: Kartotherian service logs inaccessible (systemd?) and not updated (/var/log) - https://phabricator.wikimedia.org/T115067#1737063 (10Dzahn) @akosiaris wanna review https://gerrit.wikimedia.org/r/#/c/244627/ ?